MuLabPKU / TransMLALinks

TransMLA: Multi-Head Latent Attention Is All You Need (NeurIPS 2025 Spotlight)

☆422

Alternatives and similar repositories for TransMLA

Users that are interested in TransMLA are comparing it to the libraries listed below

Sorting:

JT-Ushio / MHA2MLA
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
☆198Updated 3 weeks ago
OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆338Updated 10 months ago
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆515Updated 10 months ago
tensorgi / TPA
[NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)
☆438Updated 2 weeks ago
dhcode-cpp / NSA-pytorch
DeepSeek Native Sparse Attention pytorch implementation
☆110Updated 2 weeks ago
fla-org / native-sparse-attention
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
☆948Updated 9 months ago
NVlabs / Fast-dLLM
Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"
☆758Updated last month
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆324Updated last month
QwenLM / ParScale
Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling
☆465Updated 7 months ago
step-law / steplaw
☆208Updated 2 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆394Updated 6 months ago
weigao266 / Awesome-Efficient-Arch
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
☆380Updated last month
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆256Updated 4 months ago
yaof20 / Flash-RL
Implementation for FP8/INT8 Rollout for RL training without performence drop.
☆282Updated last month
microsoft / SeerAttention
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
☆181Updated 3 months ago
astramind-ai / Mixture-of-depths
Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆176Updated last year
RyanLiu112 / compute-optimal-tts
Official codebase for "Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling".
☆278Updated 10 months ago
NVlabs / Minitron
A family of compressed models obtained via pruning and knowledge distillation
☆361Updated last month
XunhaoLai / native-sparse-attention-triton
Efficient triton implementation of Native Sparse Attention.
☆257Updated 7 months ago
NVlabs / GatedDeltaNet
[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
☆411Updated 3 months ago
mdy666 / Qwen-Native-Sparse-Attention
qwen-nsa
☆86Updated 2 months ago
pprp / Awesome-Efficient-MoE
Efficient Mixture of Experts for LLM Paper List
☆153Updated 3 months ago
ZihanWang314 / CoE
Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models
☆227Updated last month
shreyansh26 / FlashAttention-PyTorch
Implementation of FlashAttention in PyTorch
☆178Updated 11 months ago
thu-ml / ReMoE
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
☆104Updated last year
OpenSparseLLMs / Linear-MoE
☆126Updated 6 months ago
lucidrains / native-sparse-attention-pytorch
Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper
☆791Updated 4 months ago
mit-han-lab / x-attention
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆262Updated 5 months ago
qingkelab / qingketalk
青稞Talk
☆180Updated 3 weeks ago
NVlabs / MaskLLM
[NeurIPS 24 Spotlight] MaskLLM: Learnable Semi-structured Sparsity for Large Language Models
☆183Updated last year