JT-Ushio / MHA2MLALinks
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
☆176Updated last week
Alternatives and similar repositories for MHA2MLA
Users that are interested in MHA2MLA are comparing it to the libraries listed below
Sorting:
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆133Updated last year
- TransMLA: Multi-Head Latent Attention Is All You Need☆302Updated this week
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆209Updated this week
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆395Updated last month
- Efficient triton implementation of Native Sparse Attention.☆167Updated last month
- Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"☆233Updated last week
- Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme☆130Updated 2 months ago
- slime is a LLM post-training framework aiming at scaling RL.☆328Updated this week
- ☆190Updated 2 months ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆466Updated 4 months ago
- 🔥 A minimal training framework for scaling FLA models☆170Updated last week
- ☆77Updated 2 months ago
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆137Updated 11 months ago
- An Open Math Pre-trainng Dataset with 370B Tokens.☆89Updated 2 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆112Updated last month
- ☆202Updated 4 months ago
- The official repo of One RL to See Them All: Visual Triple Unified Reinforcement Learning☆263Updated 3 weeks ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆163Updated last year
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆220Updated last month
- A Comprehensive Survey on Long Context Language Modeling☆151Updated 2 weeks ago
- [ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"☆410Updated 8 months ago
- Simple extension on vLLM to help you speed up reasoning model without training.☆161Updated 3 weeks ago
- Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models☆169Updated last month
- ☆152Updated last month
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆126Updated last week
- Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning☆186Updated 3 months ago
- Efficient LLM Inference over Long Sequences☆377Updated 2 weeks ago
- [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale☆251Updated 2 weeks ago
- General Reasoner: Advancing LLM Reasoning Across All Domains☆141Updated last week
- VeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Framework☆353Updated last month