ambisinister / mla-experimentsLinks

Experiments on Multi-Head Latent Attention

☆94

Alternatives and similar repositories for mla-experiments

Users that are interested in mla-experiments are comparing it to the libraries listed below

Sorting:

nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆152Updated last month
Dao-AILab / grouped-latent-attention
☆123Updated 2 months ago
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆69Updated 5 months ago
PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆82Updated 3 weeks ago
insuhan / hyper-attn
☆81Updated last year
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆88Updated last month
sustcsonglin / linear-attention-and-beyond-slides
☆79Updated 5 months ago
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆220Updated last month
OpenNLPLab / LASP
Linear Attention Sequence Parallelism (LASP)
☆85Updated last year
mengxiayu / LLMSuperWeight
Code for studying the super weight in LLM
☆114Updated 8 months ago
dame-cell / Triformer
Transformers components but in Triton
☆34Updated 2 months ago
kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆106Updated 2 years ago
astramind-ai / Mixture-of-depths
Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆167Updated last year
thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆130Updated last year
tridao / flash-attention-wheels
☆53Updated last year
CASE-Lab-UMD / Unified-MoE-Compression
The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".
☆72Updated 4 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆165Updated last year
sramshetty / mixture-of-depths
An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆35Updated last year
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆230Updated 8 months ago
lucidrains / PEER-pytorch
Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind
☆127Updated 11 months ago
epfml / dynamic-sparse-flash-attention
☆147Updated 2 years ago
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆71Updated last year
lucidrains / infini-transformer-pytorch
Implementation of Infini-Transformer in Pytorch
☆110Updated 7 months ago
kyegomez / Mixture-of-Depths
Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆103Updated this week
AnswerDotAI / cold-compress
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…
☆140Updated 11 months ago
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆61Updated 9 months ago
FasterDecoding / TEAL
☆137Updated 5 months ago
huggingface / kernels
Load compute kernels from the Hub
☆220Updated last week
wdlctc / mini-s
☆51Updated 9 months ago
mgmalek / efficient_cross_entropy
☆114Updated last year