gpu-mode / ring-attention
ring-attention experiments
☆96Updated 3 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for ring-attention
- ☆144Updated this week
- Cataloging released Triton kernels.☆133Updated 2 months ago
- Applied AI experiments and examples for PyTorch☆160Updated last week
- Simple and fast low-bit matmul kernels in CUDA / Triton☆137Updated this week
- Collection of kernels written in Triton language☆63Updated 2 weeks ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆191Updated 3 weeks ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- This repository contains the experimental PyTorch native float8 training UX☆211Updated 3 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆184Updated last month
- Triton-based implementation of Sparse Mixture of Experts.☆184Updated last month
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆163Updated this week
- Odysseus: Playground of LLM Sequence Parallelism☆55Updated 4 months ago
- ☆96Updated last month
- ☆88Updated 2 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆46Updated this week
- extensible collectives library in triton☆63Updated last month
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers☆196Updated 2 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆66Updated 5 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆146Updated 4 months ago
- ☆72Updated 4 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆51Updated last week
- ☆133Updated 9 months ago
- KV cache compression for high-throughput LLM inference☆82Updated last week
- ☆95Updated last month
- ☆46Updated last month
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆195Updated last week
- ☆55Updated 5 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆104Updated last month
- ☆156Updated last year
- Fast Inference of MoE Models with CPU-GPU Orchestration☆170Updated last week