fxmeng / TransMLA
TransMLA: Multi-Head Latent Attention Is All You Need
☆231Updated last month
Alternatives and similar repositories for TransMLA:
Users that are interested in TransMLA are comparing it to the libraries listed below
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆158Updated this week
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆278Updated last month
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆182Updated last week
- Efficient triton implementation of Native Sparse Attention.☆135Updated last week
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆157Updated 9 months ago
- 🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"☆621Updated 3 weeks ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆451Updated 2 months ago
- DeepSeek Native Sparse Attention pytorch implementation☆60Updated last month
- 🔥 A minimal training framework for scaling FLA models☆101Updated this week
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆473Updated this week
- [NeurIPS 24 Spotlight] MaskLLM: Learnable Semi-structured Sparsity for Large Language Models☆161Updated 3 months ago
- ☆191Updated 5 months ago
- The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆123Updated 4 months ago
- qwen-nsa☆49Updated last week
- ☆151Updated this week
- OpenSeek aims to unite the global open source community to drive collaborative innovation in algorithms, data and systems to develop next…☆131Updated last week
- Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models☆155Updated last month
- Awesome list for LLM pruning.☆222Updated 4 months ago
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆121Updated 3 months ago
- (Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …☆160Updated 11 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoring☆137Updated 3 weeks ago
- The official implementation of Tensor ProducT ATTenTion Transformer (T6)☆359Updated this week
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)☆252Updated 3 weeks ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆97Updated last week
- ☆235Updated 11 months ago
- Official codebase for "Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling".☆247Updated last month
- [ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"☆402Updated 6 months ago
- Awesome LLM pruning papers all-in-one repository with integrating all useful resources and insights.☆83Updated 4 months ago
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models☆212Updated 2 weeks ago
- The official GitHub page for the survey paper "A Survey on Mixture of Experts in Large Language Models".☆327Updated last month