fla-org / native-sparse-attention
π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
β636Updated last month
Alternatives and similar repositories for native-sparse-attention:
Users that are interested in native-sparse-attention are comparing it to the libraries listed below
- Ring attention implementation with flash attentionβ743Updated 2 weeks ago
- Muon is Scalable for LLM Trainingβ1,029Updated 3 weeks ago
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ601Updated 3 weeks ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ477Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ453Updated 2 months ago
- Helpful tools and examples for working with flex-attentionβ726Updated last week
- Efficient LLM Inference over Long Sequencesβ368Updated this week
- Understanding R1-Zero-Like Training: A Critical Perspectiveβ882Updated last week
- Muon optimizer: +>30% sample efficiency with <3% wallclock overheadβ577Updated last month
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMsβ161Updated this week
- TransMLA: Multi-Head Latent Attention Is All You Needβ238Updated last month
- β179Updated last week
- VeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Frameworkβ297Updated 2 weeks ago
- LLM KV cache compression made easyβ458Updated last week
- Efficient triton implementation of Native Sparse Attention.β139Updated 2 weeks ago
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ511Updated 5 months ago
- [ICML 2024] CLLMs: Consistency Large Language Modelsβ391Updated 5 months ago
- π° Must-read papers on KV Cache Compression (constantly updating π€).β376Updated 2 weeks ago
- β698Updated this week
- β419Updated this week
- Super-Efficient RLHF Training of LLMs with Parameter Reallocationβ281Updated 3 months ago
- Distributed Triton for Parallel Systemsβ451Updated 2 weeks ago
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Trainingβ248Updated this week
- A sparse attention kernel supporting mix sparse patternsβ197Updated 2 months ago
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Trainingβ184Updated last week
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ279Updated 2 months ago
- Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.β715Updated 6 months ago
- Large Context Attentionβ704Updated 3 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β317Updated 4 months ago
- DeepSeek Native Sparse Attention pytorch implementationβ62Updated last month