ambisinister / mla-experiments
Experiments on Multi-Head Latent Attention
β89Updated 8 months ago
Alternatives and similar repositories for mla-experiments
Users that are interested in mla-experiments are comparing it to the libraries listed below
Sorting:
- π₯ A minimal training framework for scaling FLA modelsβ128Updated last week
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ100Updated this week
- β128Updated 2 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsityβ73Updated 8 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.β60Updated 3 months ago
- Code for studying the super weight in LLMβ100Updated 5 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β158Updated 10 months ago
- Fast and memory-efficient exact attentionβ68Updated 2 months ago
- An extension of the nanoGPT repository for training small MOE models.β140Updated 2 months ago
- Efficient triton implementation of Native Sparse Attention.β144Updated last month
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projectionβ108Updated 2 months ago
- β81Updated last year
- Low-bit optimizers for PyTorchβ128Updated last year
- β146Updated last year
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β35Updated 11 months ago
- β71Updated 2 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)β154Updated last month
- β103Updated 11 months ago
- EE-LLM is a framework for large-scale training and inference of early-exit (EE) large language models (LLMs).β61Updated 11 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Modelsβ66Updated 6 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoringβ146Updated last month
- The official repository of Quamba1 [ICLR 2025π₯] & Quamba2 [ICML 2025π₯]β45Updated last month
- [ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.β71Updated 4 months ago
- Load compute kernels from the Hubβ116Updated this week
- CUDA and Triton implementations of Flash Attention with SoftmaxN.β70Updated 11 months ago
- Transformers components but in Tritonβ33Updated this week
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMβ161Updated 10 months ago
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β91Updated this week
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Modeβ¦β106Updated 8 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interfaceβ185Updated 11 months ago