fla-org / native-sparse-attention
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
☆653Updated last month
Alternatives and similar repositories for native-sparse-attention
Users that are interested in native-sparse-attention are comparing it to the libraries listed below
Sorting:
- Ring attention implementation with flash attention☆764Updated last month
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper☆613Updated last month
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆492Updated 3 weeks ago
- Muon is Scalable for LLM Training☆1,043Updated last month
- Muon optimizer: +>30% sample efficiency with <3% wallclock overhead☆623Updated last month
- TransMLA: Multi-Head Latent Attention Is All You Need☆247Updated this week
- Efficient LLM Inference over Long Sequences☆373Updated 2 weeks ago
- Helpful tools and examples for working with flex-attention☆766Updated last week
- ☆746Updated 3 weeks ago
- VeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Framework☆308Updated last month
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆459Updated 3 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆164Updated last week
- Super-Efficient RLHF Training of LLMs with Parameter Reallocation☆292Updated 3 weeks ago
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch☆512Updated 6 months ago
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training☆338Updated this week
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆190Updated 3 weeks ago
- Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA☆828Updated this week
- Distributed Triton for Parallel Systems☆677Updated last week
- Large Context Attention☆710Updated 3 months ago
- LLM KV cache compression made easy☆476Updated last week
- 🚀 Efficient implementations of state-of-the-art linear attention models in Torch and Triton☆2,380Updated this week
- ☆183Updated last month
- Official Repo for Open-Reasoner-Zero☆1,912Updated last month
- Microsoft Automatic Mixed Precision Library☆595Updated 7 months ago
- Understanding R1-Zero-Like Training: A Critical Perspective☆915Updated last month
- MoBA: Mixture of Block Attention for Long-Context LLMs☆1,771Updated last month
- Fast inference from large lauguage models via speculative decoding☆723Updated 8 months ago
- Efficient triton implementation of Native Sparse Attention.☆144Updated last month
- ByteCheckpoint: An Unified Checkpointing Library for LFMs☆207Updated last month
- 🔥 A minimal training framework for scaling FLA models☆128Updated last week