fkodom / grouped-query-attention-pytorchLinks
(Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (https://arxiv.org/pdf/2305.13245.pdf)
☆166Updated last year
Alternatives and similar repositories for grouped-query-attention-pytorch
Users that are interested in grouped-query-attention-pytorch are comparing it to the libraries listed below
Sorting:
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆294Updated 3 months ago
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆94Updated this week
- ☆198Updated 7 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆161Updated 11 months ago
- Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"☆123Updated last year
- Code for paper "Patch-Level Training for Large Language Models"☆86Updated 6 months ago
- ☆191Updated last year
- Efficient Mixture of Experts for LLM Paper List☆68Updated 5 months ago
- Official implementation of TransNormerLLM: A Faster and Better LLM☆243Updated last year
- ☆222Updated 11 months ago
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆121Updated 4 months ago
- 🔥 A minimal training framework for scaling FLA models☆146Updated 3 weeks ago
- Root Mean Square Layer Normalization☆241Updated 2 years ago
- ☆258Updated last year
- Inference Code for Paper "Harder Tasks Need More Experts: Dynamic Routing in MoE Models"☆50Updated 10 months ago
- Rectified Rotary Position Embeddings☆370Updated last year
- Low-bit optimizers for PyTorch☆128Updated last year
- ☆103Updated last year
- TransMLA: Multi-Head Latent Attention Is All You Need☆284Updated this week
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch☆294Updated 2 months ago
- [AAAI 2024] Fluctuation-based Adaptive Structured Pruning for Large Language Models☆51Updated last year
- qwen-nsa☆66Updated last month
- [ACL 2024] Long-Context Language Modeling with Parallel Encodings☆153Updated 11 months ago
- [ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.☆74Updated 5 months ago
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆203Updated 2 weeks ago
- ☆149Updated last year
- Implementation of FlashAttention in PyTorch☆150Updated 4 months ago
- Efficient triton implementation of Native Sparse Attention.☆155Updated last week
- DeepSeek Native Sparse Attention pytorch implementation☆70Updated 3 months ago
- PyTorch codes for "LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning"☆238Updated 2 years ago