fxmeng / TransMLALinks
TransMLA: Multi-Head Latent Attention Is All You Need
β284Updated this week
Alternatives and similar repositories for TransMLA
Users that are interested in TransMLA are comparing it to the libraries listed below
Sorting:
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β686Updated 2 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMsβ166Updated last week
- Parallel Scaling Law for Language Model β Beyond Parameter and Inference Time Scalingβ367Updated 2 weeks ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ294Updated 3 months ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ463Updated 3 months ago
- DeepSeek Native Sparse Attention pytorch implementationβ70Updated 3 months ago
- Efficient triton implementation of Native Sparse Attention.β155Updated last week
- Efficient LLM Inference over Long Sequencesβ376Updated this week
- The official implementation of Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)β373Updated 2 weeks ago
- β188Updated last month
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMsβ102Updated this week
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Trainingβ203Updated 2 weeks ago
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Modelsβ221Updated last month
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Modelsβ270Updated last week
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ506Updated last week
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β160Updated 11 months ago
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ637Updated 2 weeks ago
- Official codebase for "Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling".β261Updated 3 months ago
- π₯ A minimal training framework for scaling FLA modelsβ146Updated 3 weeks ago
- Super-Efficient RLHF Training of LLMs with Parameter Reallocationβ299Updated last month
- Ring attention implementation with flash attentionβ771Updated last week
- Muon optimizer: +>30% sample efficiency with <3% wallclock overheadβ661Updated last week
- The official GitHub page for the survey paper "A Survey on Mixture of Experts in Large Language Models".β361Updated 2 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β333Updated 5 months ago
- Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) modelsβ167Updated 2 weeks ago
- A family of compressed models obtained via pruning and knowledge distillationβ341Updated 6 months ago
- The homepage of OneBit model quantization framework.β180Updated 3 months ago
- PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models(NeurIPS 2024 Spotlight)β356Updated 4 months ago
- The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>β128Updated last week
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β237Updated 4 months ago