fla-org / native-sparse-attention
π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
β601Updated last week
Alternatives and similar repositories for native-sparse-attention:
Users that are interested in native-sparse-attention are comparing it to the libraries listed below
- Muon is Scalable for LLM Trainingβ993Updated this week
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ567Updated this week
- Ring attention implementation with flash attentionβ717Updated last month
- Muon optimizer: +>30% sample efficiency with <3% wallclock overheadβ529Updated this week
- Efficient LLM Inference over Long Sequencesβ365Updated last month
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ458Updated last week
- TransMLA: Multi-Head Latent Attention Is All You Needβ221Updated 3 weeks ago
- Helpful tools and examples for working with flex-attentionβ698Updated last week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ442Updated last month
- Understanding R1-Zero-Like Training: A Critical Perspectiveβ725Updated this week
- π° Must-read papers on KV Cache Compression (constantly updating π€).β355Updated this week
- Large Reasoning Modelsβ800Updated 3 months ago
- β396Updated this week
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMsβ149Updated this week
- Official Repo for Open-Reasoner-Zeroβ1,687Updated 3 weeks ago
- LLM KV cache compression made easyβ444Updated last week
- Efficient triton implementation of Native Sparse Attention.β127Updated this week
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β311Updated 3 months ago
- π Efficient implementations of state-of-the-art linear attention models in Torch and Tritonβ2,177Updated this week
- [NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which rβ¦β945Updated last month
- Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernelsβ858Updated this week
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β399Updated 3 months ago
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ506Updated 5 months ago
- Puzzles for learning Triton, play it with minimal environment configuration!β267Updated 3 months ago
- The official implementation of Tensor ProducT ATTenTion Transformer (T6)β345Updated last month
- Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Modelsβ439Updated this week
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Trainingβ168Updated last month
- Super-Efficient RLHF Training of LLMs with Parameter Reallocationβ242Updated 2 months ago
- An Open-source RL System from ByteDance Seed and Tsinghua AIRβ915Updated this week
- [ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ541Updated last month