fkodom / grouped-query-attention-pytorchLinks

(Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (https://arxiv.org/pdf/2305.13245.pdf)

☆181

Alternatives and similar repositories for grouped-query-attention-pytorch

Users that are interested in grouped-query-attention-pytorch are comparing it to the libraries listed below

Sorting:

OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆330Updated 8 months ago
thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆132Updated 2 years ago
transformer-vq / transformer_vq
☆197Updated last year
yxli2123 / LoftQ
☆231Updated last year
OpenNLPLab / TransnormerLLM
Official implementation of TransNormerLLM: A Faster and Better LLM
☆247Updated last year
kyegomez / AttentionIsOFFByOne
Implementation of "Attention Is Off By One" by Evan Miller
☆196Updated 2 years ago
astramind-ai / Mixture-of-depths
Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆174Updated last year
bzhangGo / rmsnorm
Root Mean Square Layer Normalization
☆256Updated 2 years ago
QingruZhang / AdaLoRA
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning (ICLR 2023).
☆353Updated 2 years ago
bojone / rerope
Rectified Rotary Position Embeddings
☆381Updated last year
Outsider565 / LoRA-GA
☆213Updated last year
fxmeng / TransMLA
TransMLA: Multi-Head Latent Attention Is All You Need (NeurIPS 2025 Spotlight)
☆389Updated last month
nbasyl / DoRA
Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"
☆124Updated last year
ambisinister / mla-experiments
Experiments on Multi-Head Latent Attention
☆97Updated last year
Cohere-Labs-Community / parameter-efficient-moe
☆271Updated last year
nengwp / Lion-vs-Adam
Lion and Adam optimization comparison
☆64Updated 2 years ago
kyegomez / Mixture-of-Depths
Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆108Updated last week
shreyansh26 / Speculative-Sampling
Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind
☆104Updated last year
timinar / BabyLlama
Training code for Baby-Llama, our submission to the strict-small track of the BabyLM challenge.
☆84Updated 2 years ago
thu-ml / ReMoE
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
☆97Updated 10 months ago
kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆108Updated 2 years ago
YuchuanTian / DiJiang
[ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…
☆104Updated last year
SimiaoZuo / MoEBERT
This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).
☆112Updated 3 years ago
pprp / Awesome-Efficient-MoE
Efficient Mixture of Experts for LLM Paper List
☆140Updated last month
thunlp / MoEfication
☆140Updated last year
DRSY / EMO
[ICLR 2024]EMO: Earth Mover Distance Optimization for Auto-Regressive Language Modeling(https://arxiv.org/abs/2310.04691)
☆126Updated last year
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆317Updated 7 months ago
dilab-zju / self-speculative-decoding
Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**
☆205Updated 8 months ago
jongwooko / distillm
Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)
☆234Updated 7 months ago
thu-coai / MiniPLM
[ICLR 2025] MiniPLM: Knowledge Distillation for Pre-Training Language Models
☆61Updated 11 months ago