JT-Ushio / MHA2MLALinks

Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

☆196

Alternatives and similar repositories for MHA2MLA

Users that are interested in MHA2MLA are comparing it to the libraries listed below

Sorting:

QwenLM / ParScale
Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling
☆456Updated 6 months ago
MuLabPKU / TransMLA
TransMLA: Multi-Head Latent Attention Is All You Need (NeurIPS 2025 Spotlight)
☆413Updated 2 months ago
step-law / steplaw
☆207Updated last month
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆250Updated 3 months ago
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆507Updated 9 months ago
SkyworkAI / Skywork-MoE
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
☆137Updated last year
yaof20 / Flash-RL
Implementation for FP8/INT8 Rollout for RL training without performence drop.
☆275Updated 3 weeks ago
stepfun-ai / Step3
☆439Updated 3 months ago
bigai-nlco / TokenSwift
[ICML 2025] |TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation
☆118Updated 6 months ago
OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆335Updated 9 months ago
XunhaoLai / native-sparse-attention-triton
Efficient triton implementation of Native Sparse Attention.
☆251Updated 6 months ago
thu-ml / ReMoE
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
☆99Updated 11 months ago
OpenSparseLLMs / Linear-MoE
☆120Updated 5 months ago
ZihanWang314 / CoE
Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models
☆223Updated 3 weeks ago
weigao266 / Awesome-Efficient-Arch
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
☆368Updated 3 weeks ago
microsoft / SeerAttention
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
☆175Updated 2 months ago
Tencent / AngelSlim
Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
☆212Updated this week
hao-ai-lab / Dynasor
[NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.
☆209Updated 6 months ago
astramind-ai / Mixture-of-depths
Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆175Updated last year
NVlabs / Fast-dLLM
Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"
☆713Updated this week
inclusionAI / Ling-V2
Ling-V2 is a MoE LLM provided and open-sourced by InclusionAI.
☆237Updated last month
NVIDIA-NeMo / Megatron-Bridge
HuggingFace conversion and training library for Megatron-based models
☆228Updated this week
jshuadvd / LongRoPE
Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper
☆152Updated last year
RUC-GSAI / YuLan-Mini
A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.
☆222Updated 4 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆392Updated 5 months ago
Dao-AILab / grouped-latent-attention
☆132Updated 6 months ago
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆311Updated 2 weeks ago
FasterDecoding / SnapKV
☆290Updated 4 months ago
mdy666 / Qwen-Native-Sparse-Attention
qwen-nsa
☆84Updated last month
MiroMindAI / MiroRL
MiroRL is an MCP-first reinforcement learning framework for deep research agent.
☆180Updated 3 months ago