ambisinister / mla-experiments
Experiments on Multi-Head Latent Attention
β87Updated 8 months ago
Alternatives and similar repositories for mla-experiments:
Users that are interested in mla-experiments are comparing it to the libraries listed below
- π₯ A minimal training framework for scaling FLA modelsβ107Updated last week
- Code for studying the super weight in LLMβ98Updated 4 months ago
- Transformers components but in Tritonβ32Updated last month
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β90Updated this week
- Efficient triton implementation of Native Sparse Attention.β136Updated 2 weeks ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.β59Updated 2 months ago
- β125Updated last year
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β158Updated 10 months ago
- Fast and memory-efficient exact attentionβ67Updated last month
- β100Updated 10 months ago
- DPO, but faster πβ40Updated 4 months ago
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projectionβ99Updated 2 months ago
- Implementation of Infini-Transformer in Pytorchβ110Updated 3 months ago
- β48Updated last year
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"β97Updated 6 months ago
- Triton-based implementation of Sparse Mixture of Experts.β210Updated 4 months ago
- Linear Attention Sequence Parallelism (LASP)β82Updated 10 months ago
- β69Updated last month
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β35Updated 10 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.β93Updated last year
- MambaFormer in-context learning experiments and implementation for https://arxiv.org/abs/2402.04248β51Updated 10 months ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performanceβ¦β148Updated 2 weeks ago
- (Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from β¦β162Updated 11 months ago
- The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".β66Updated last month
- Boosting 4-bit inference kernels with 2:4 Sparsityβ72Updated 7 months ago
- β75Updated last week
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)β57Updated 6 months ago
- β50Updated 5 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoringβ140Updated 3 weeks ago
- β22Updated last year