kyegomez / MultiQueryAttentionLinks

This is a simple torch implementation of the high performance Multi-Query Attention

☆15

Alternatives and similar repositories for MultiQueryAttention

Users that are interested in MultiQueryAttention are comparing it to the libraries listed below

Sorting:

TRI-ML / linear_open_lm
A repository for research on medium sized language models.
☆78Updated last year
recursal / GoldFinch-paper
GoldFinch and other hybrid transformer components
☆45Updated last year
piotrpiekos / MoSA
User-friendly implementation of the Mixture-of-Sparse-Attention (MoSA). MoSA selects distinct tokens for each head with expert choice rou…
☆28Updated 6 months ago
Qichuzyy / POA
Official implementation of ECCV24 paper: POA
☆24Updated last year
gmongaras / Cottention_Transformer
Code for the paper "Cottention: Linear Transformers With Cosine Attention"
☆20Updated this week
RobertCsordas / moe_attention
Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"
☆101Updated last year
OpenNLPLab / HGRN2
HGRN2: Gated Linear RNNs with State Expansion
☆55Updated last year
vmarinowski / infini-attention
An unofficial pytorch implementation of 'Efficient Infinite Context Transformers with Infini-attention'
☆54Updated last year
giangdip2410 / HyperRouter
Code for this paper "HyperRouter: Towards Efficient Training and Inference of Sparse Mixture of Experts via HyperNetwork"
☆33Updated last year
kyegomez / Infini-attention
Implementation of the paper: "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" from Google in pyTO…
☆56Updated last month
xuhaoxh / infini-gram-mini
☆30Updated last month
RWKV / RWKV-LM
RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best…
☆53Updated 8 months ago
dangxingyu / rnn-icrag
Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"
☆27Updated last year
frankxwang / dpo-prefix-sharing
DPO, but faster 🚀
☆46Updated 11 months ago
chijames / KERPLE
☆19Updated 3 years ago
thunlp / SparsingLaw
The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".
☆27Updated last year
sail-sg / SkyLadder
The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling
☆40Updated last month
RobertCsordas / moeut
☆88Updated last year
graphcore-research / jax-scalify
JAX Scalify: end-to-end scaled arithmetics
☆17Updated last year
VITA-Group / WeLore
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…
☆51Updated 3 weeks ago
kiddyboots216 / lottery-ticket-adaptation
Lottery Ticket Adaptation
☆40Updated last year
tianyi-lab / C3PO
[COLM 2025] "C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing"
☆18Updated 7 months ago
RWKV / ZeroCoT
https://x.com/BlinkDL_AI/status/1884768989743882276
☆28Updated 6 months ago
yale-nlp / refdpo
☆16Updated last year
smonsays / hypernetwork-attention
Official code for the paper "Attention as a Hypernetwork"
☆46Updated last year
LCM-Lab / LOGO
Code for paper: Long cOntext aliGnment via efficient preference Optimization
☆23Updated last month
BlinkDL / LinearAttentionArena
Here we will test various linear attention designs.
☆61Updated last year
RobertCsordas / moe
Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"
☆38Updated 5 months ago
kyegomez / TTL
Pytorch Implementation of the paper: "Learning to (Learn at Test Time): RNNs with Expressive Hidden States"
☆25Updated this week
lucidrains / infini-transformer-pytorch
Implementation of Infini-Transformer in Pytorch
☆113Updated 10 months ago