fla-org / native-sparse-attentionLinks
π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
β964Updated this week
Alternatives and similar repositories for native-sparse-attention
Users that are interested in native-sparse-attention are comparing it to the libraries listed below
Sorting:
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ797Updated 5 months ago
- Ring attention implementation with flash attentionβ979Updated 4 months ago
- Muon is Scalable for LLM Trainingβ1,426Updated 6 months ago
- TransMLA: Multi-Head Latent Attention Is All You Need (NeurIPS 2025 Spotlight)β429Updated 4 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ641Updated 3 weeks ago
- Helpful tools and examples for working with flex-attentionβ1,118Updated 3 weeks ago
- Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"β825Updated last week
- π₯ A minimal training framework for scaling FLA modelsβ343Updated 2 months ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ452Updated 4 months ago
- Efficient triton implementation of Native Sparse Attention.β262Updated 8 months ago
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Trainingβ631Updated this week
- VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zooβ1,620Updated this week
- β579Updated 4 months ago
- Parallel Scaling Law for Language Model β Beyond Parameter and Inference Time Scalingβ468Updated 8 months ago
- Accelerating MoE with IO and Tile-aware Optimizationsβ569Updated 3 weeks ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ524Updated 11 months ago
- Efficient LLM Inference over Long Sequencesβ394Updated 7 months ago
- Training library for Megatron-based models with bidirectional Hugging Face conversion capabilityβ419Updated this week
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMsβ204Updated 2 months ago
- Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.β830Updated this week
- LLM KV cache compression made easyβ876Updated last week
- Scalable toolkit for efficient model reinforcementβ1,293Updated this week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β676Updated this week
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ340Updated 11 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β658Updated 4 months ago
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ549Updated 8 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,180Updated 4 months ago
- β449Updated 5 months ago
- A sparse attention kernel supporting mix sparse patternsβ453Updated 3 weeks ago
- [NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)β445Updated 2 weeks ago