OpenMachine-ai / transformer-tricks
A collection of tricks and tools to speed up transformer models
β157Updated last month
Alternatives and similar repositories for transformer-tricks:
Users that are interested in transformer-tricks are comparing it to the libraries listed below
- Efficient LLM Inference over Long Sequencesβ372Updated last week
- π₯ A minimal training framework for scaling FLA modelsβ119Updated this week
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)β153Updated 3 weeks ago
- β54Updated last month
- RWKV-7: Surpassing GPTβ84Updated 5 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMsβ163Updated this week
- Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repβ¦β58Updated 6 months ago
- β132Updated 5 months ago
- PyTorch implementation of models from the Zamba2 series.β180Updated 3 months ago
- Efficient triton implementation of Native Sparse Attention.β142Updated 3 weeks ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ160Updated last month
- β131Updated last month
- [NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Modelsβ215Updated this week
- β69Updated 2 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ126Updated 5 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Modelsβ132Updated 10 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β231Updated 3 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimzerβ91Updated this week
- Load compute kernels from the Hubβ115Updated 2 weeks ago
- Work in progress.β58Updated last month
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.β60Updated 3 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoringβ142Updated last month
- β48Updated last year
- β280Updated 2 weeks ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ284Updated 2 months ago
- KV cache compression for high-throughput LLM inferenceβ124Updated 3 months ago
- β45Updated last week
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.β129Updated this week
- Linear Attention Sequence Parallelism (LASP)β82Updated 11 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"β158Updated 10 months ago