OpenMachine-ai / transformer-tricksLinks

A collection of tricks and tools to speed up transformer models

☆169

Alternatives and similar repositories for transformer-tricks

Users that are interested in transformer-tricks are comparing it to the libraries listed below

Sorting:

kyleliang919 / Super_Muon
☆60Updated 4 months ago
JT-Ushio / MHA2MLA
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
☆183Updated last month
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆149Updated last month
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆385Updated last month
HanGuo97 / log-linear-attention
☆232Updated 2 months ago
NVlabs / GatedDeltaNet
[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
☆193Updated 4 months ago
BlinkDL / modded-nanogpt-rwkv
RWKV-7: Surpassing GPT
☆94Updated 8 months ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆86Updated last month
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆127Updated 8 months ago
wdlctc / headinfer
☆54Updated 2 months ago
Dao-AILab / grouped-latent-attention
☆123Updated 2 months ago
HazyResearch / lolcats
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆244Updated 6 months ago
Jellyfish042 / Sudoku-RWKV
☆144Updated 8 months ago
sustcsonglin / linear-attention-and-beyond-slides
☆79Updated 5 months ago
IST-DASLab / QuEST
Work in progress.
☆70Updated last month
BorealisAI / neuzip
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This rep…
☆59Updated 9 months ago
VITA-Group / WeLore
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…
☆47Updated 3 months ago
itsnamgyu / block-transformer
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
☆160Updated 3 months ago
tridao / flash-attention-wheels
☆52Updated last year
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆288Updated 2 months ago
astramind-ai / BitMat
An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
☆154Updated 9 months ago
QwenLM / ParScale
Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling
☆428Updated 2 months ago
Cornell-RelaxML / qtip
☆145Updated last month
OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆323Updated 5 months ago
NX-AI / flashrnn
FlashRNN - Fast RNN Kernels with I/O Awareness
☆93Updated last month
NVIDIA / ngpt
Normalized Transformer (nGPT)
☆185Updated 8 months ago
sebulo / LoQT
☆80Updated 8 months ago
Zyphra / Zamba2
PyTorch implementation of models from the Zamba2 series.
☆184Updated 6 months ago
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆134Updated 6 months ago
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆141Updated this week