ambisinister / mla-experimentsLinks
Experiments on Multi-Head Latent Attention
β92Updated 10 months ago
Alternatives and similar repositories for mla-experiments
Users that are interested in mla-experiments are comparing it to the libraries listed below
Sorting:
- π₯ A minimal training framework for scaling FLA modelsβ178Updated 2 weeks ago
- β114Updated 3 weeks ago
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ131Updated last week
- β50Updated last year
- Fast and memory-efficient exact attentionβ68Updated 3 months ago
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β95Updated 2 weeks ago
- The evaluation framework for training-free sparse attention in LLMsβ69Updated last week
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β163Updated last year
- Code for studying the super weight in LLMβ107Updated 6 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmindβ127Updated 10 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoringβ166Updated this week
- Triton-based implementation of Sparse Mixture of Experts.β219Updated 6 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"β98Updated 8 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.β70Updated last week
- The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".β71Updated 3 months ago
- β81Updated last year
- Implementation of Infini-Transformer in Pytorchβ111Updated 5 months ago
- β109Updated last year
- Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.β57Updated last month
- Transformers components but in Tritonβ34Updated last month
- Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernelsβ106Updated last year
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top ofβ¦β137Updated 10 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.β136Updated this week
- Linear Attention Sequence Parallelism (LASP)β84Updated last year
- Work in progress.β69Updated 2 weeks ago
- An extension of the nanoGPT repository for training small MOE models.β152Updated 3 months ago
- β105Updated 10 months ago
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β35Updated last year
- Low-bit optimizers for PyTorchβ129Updated last year
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performanceβ¦β149Updated 2 months ago