fla-org / native-sparse-attentionLinks
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
☆856Updated 6 months ago
Alternatives and similar repositories for native-sparse-attention
Users that are interested in native-sparse-attention are comparing it to the libraries listed below
Sorting:
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper☆744Updated last month
- Ring attention implementation with flash attention☆866Updated last week
- Muon is Scalable for LLM Training☆1,311Updated last month
- TransMLA: Multi-Head Latent Attention Is All You Need☆356Updated 2 weeks ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆562Updated last week
- Helpful tools and examples for working with flex-attention☆970Updated last week
- VeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Framework☆1,087Updated 3 weeks ago
- 🔥 A minimal training framework for scaling FLA models☆239Updated last week
- Efficient triton implementation of Native Sparse Attention.☆218Updated 3 months ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆443Updated 4 months ago
- slime is a LLM post-training framework for RL Scaling.☆1,747Updated last week
- Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"☆453Updated last week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆490Updated 7 months ago
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training☆508Updated this week
- ☆423Updated last month
- Scalable toolkit for efficient model reinforcement☆857Updated this week
- Muon is an optimizer for hidden layers in neural networks☆1,710Updated 2 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆188Updated 2 months ago
- Efficient LLM Inference over Long Sequences☆391Updated 2 months ago
- ☆638Updated last week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆374Updated last week
- LLM KV cache compression made easy☆609Updated this week
- ☆519Updated last month
- ☆201Updated 5 months ago
- A sparse attention kernel supporting mix sparse patterns☆296Updated 7 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆328Updated 6 months ago
- ☆814Updated 3 months ago
- Large Context Attention☆736Updated 7 months ago
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch☆537Updated 4 months ago
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆239Updated last month