ambisinister / mla-experimentsLinks
Experiments on Multi-Head Latent Attention
☆91Updated 9 months ago
Alternatives and similar repositories for mla-experiments
Users that are interested in mla-experiments are comparing it to the libraries listed below
Sorting:
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆121Updated last week
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆126Updated 9 months ago
- Fast and memory-efficient exact attention☆68Updated 3 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆62Updated 4 months ago
- 🔥 A minimal training framework for scaling FLA models☆146Updated 3 weeks ago
- ☆93Updated last week
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆94Updated this week
- Transformers components but in Triton☆33Updated 3 weeks ago
- Low-bit optimizers for PyTorch☆128Updated last year
- Code for studying the super weight in LLM☆104Updated 6 months ago
- Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.☆56Updated 2 weeks ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆161Updated 11 months ago
- Official implementation of "The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs"☆32Updated last month
- Implementation of Infini-Transformer in Pytorch☆111Updated 5 months ago
- ☆74Updated 3 months ago
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆35Updated 11 months ago
- An extension of the nanoGPT repository for training small MOE models.☆147Updated 2 months ago
- ☆93Updated 2 weeks ago
- Linear Attention Sequence Parallelism (LASP)☆83Updated last year
- ☆81Updated last year
- From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…☆47Updated last month
- Efficient triton implementation of Native Sparse Attention.☆155Updated last week
- Simple and efficient pytorch-native transformer training and inference (batched)☆75Updated last year
- ☆79Updated 9 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆134Updated 9 months ago
- [ICML 2025] Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization☆70Updated this week
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆97Updated 8 months ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆234Updated 3 months ago
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"☆104Updated 3 weeks ago
- Triton-based implementation of Sparse Mixture of Experts.☆217Updated 6 months ago