fxmeng / TransMLA
TransMLA: Multi-Head Latent Attention Is All You Need
☆243Updated this week
Alternatives and similar repositories for TransMLA:
Users that are interested in TransMLA are comparing it to the libraries listed below
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆163Updated this week
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆284Updated 2 months ago
- 🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"☆647Updated last month
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆456Updated 2 months ago
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆189Updated 2 weeks ago
- Efficient triton implementation of Native Sparse Attention.☆142Updated 3 weeks ago
- DeepSeek Native Sparse Attention pytorch implementation☆63Updated 2 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆158Updated 10 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆490Updated 2 weeks ago
- qwen-nsa☆57Updated 3 weeks ago
- 🔥 A minimal training framework for scaling FLA models☆119Updated this week
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆97Updated last week
- Efficient LLM Inference over Long Sequences☆372Updated last week
- A family of compressed models obtained via pruning and knowledge distillation☆336Updated 5 months ago
- ☆183Updated 3 weeks ago
- Ring attention implementation with flash attention☆759Updated last month
- Awesome list for LLM pruning.☆224Updated 4 months ago
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆91Updated this week
- ☆123Updated last week
- ☆194Updated 6 months ago
- (Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …☆165Updated 11 months ago
- PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models(NeurIPS 2024 Spotlight)☆348Updated 3 months ago
- ☆238Updated last year
- Official codebase for "Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling".☆253Updated 2 months ago
- The official implementation of Tensor ProducT ATTenTion Transformer (T6)☆367Updated 3 weeks ago
- XAttention: Block Sparse Attention with Antidiagonal Scoring☆142Updated last month
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models☆215Updated this week
- Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning☆175Updated last month
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)☆263Updated 2 weeks ago
- VeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Framework☆306Updated last month