OpenMachine-ai / transformer-tricksLinks
A collection of tricks and tools to speed up transformer models
☆167Updated 3 weeks ago
Alternatives and similar repositories for transformer-tricks
Users that are interested in transformer-tricks are comparing it to the libraries listed below
Sorting:
- ☆216Updated 2 weeks ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆395Updated last month
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆176Updated this week
- ☆114Updated 3 weeks ago
- Efficient LLM Inference over Long Sequences☆378Updated 3 weeks ago
- ☆56Updated 3 months ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆174Updated 3 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆126Updated 6 months ago
- 🔥 A minimal training framework for scaling FLA models☆178Updated 2 weeks ago
- RWKV-7: Surpassing GPT☆91Updated 7 months ago
- Efficient triton implementation of Native Sparse Attention.☆168Updated last month
- Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) models☆170Updated this week
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆151Updated this week
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆273Updated last month
- Reverse Engineering Gemma 3n: Google's New Edge-Optimized Language Model☆127Updated 3 weeks ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆113Updated last month
- Load compute kernels from the Hub☆191Updated this week
- The evaluation framework for training-free sparse attention in LLMs☆69Updated last week
- PyTorch implementation of models from the Zamba2 series.☆182Updated 5 months ago
- ☆50Updated last month
- Work in progress.☆69Updated 2 weeks ago
- XAttention: Block Sparse Attention with Antidiagonal Scoring☆166Updated this week
- KV cache compression for high-throughput LLM inference☆131Updated 4 months ago
- Normalized Transformer (nGPT)☆184Updated 7 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆313Updated 4 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆239Updated 4 months ago
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆163Updated last year
- [ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.☆81Updated 6 months ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆149Updated 2 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆131Updated last week