fla-org / native-sparse-attentionLinks
π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
β833Updated 5 months ago
Alternatives and similar repositories for native-sparse-attention
Users that are interested in native-sparse-attention are comparing it to the libraries listed below
Sorting:
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ730Updated 2 weeks ago
- Ring attention implementation with flash attentionβ849Updated 3 weeks ago
- Muon is Scalable for LLM Trainingβ1,289Updated 3 weeks ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ550Updated last month
- TransMLA: Multi-Head Latent Attention Is All You Needβ343Updated last month
- Parallel Scaling Law for Language Model β Beyond Parameter and Inference Time Scalingβ432Updated 3 months ago
- Helpful tools and examples for working with flex-attentionβ943Updated last week
- VeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Frameworkβ975Updated this week
- slime is a LLM post-training framework aiming for RL Scaling.β1,420Updated this week
- Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"β386Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ487Updated 6 months ago
- π₯ A minimal training framework for scaling FLA modelsβ233Updated last week
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Trainingβ493Updated this week
- Efficient triton implementation of Native Sparse Attention.β209Updated 3 months ago
- Muon is an optimizer for hidden layers in neural networksβ1,595Updated last month
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMsβ189Updated 2 months ago
- Scalable toolkit for efficient model reinforcementβ796Updated this week
- LLM KV cache compression made easyβ596Updated this week
- Efficient LLM Inference over Long Sequencesβ390Updated 2 months ago
- Large Context Attentionβ729Updated 7 months ago
- β811Updated 2 months ago
- β408Updated 2 weeks ago
- The official implementation of TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)β381Updated this week
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ536Updated 3 months ago
- β516Updated last month
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,490Updated last week
- MoBA: Mixture of Block Attention for Long-Context LLMsβ1,878Updated 4 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦