OpenMachine-ai / transformer-tricks
A collection of tricks and tools to speed up transformer models
β150Updated last week
Alternatives and similar repositories for transformer-tricks:
Users that are interested in transformer-tricks are comparing it to the libraries listed below
- π₯ A minimal training framework for scaling FLA modelsβ94Updated last week
- Efficient LLM Inference over Long Sequencesβ366Updated last month
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMsβ158Updated this week
- β67Updated last month
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ125Updated 4 months ago
- Efficient triton implementation of Native Sparse Attention.β134Updated last week
- Normalized Transformer (nGPT)β166Updated 4 months ago
- Load compute kernels from the Hubβ113Updated this week
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Modelsβ277Updated last month
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Ruleβ147Updated 3 weeks ago
- nanoGRPO is a lightweight implementation of Group Relative Policy Optimization (GRPO)β96Updated last week
- Fast and memory-efficient exact attentionβ67Updated last month
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)β151Updated 3 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.β59Updated 2 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β313Updated 3 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoringβ134Updated last week
- β41Updated last week
- When it comes to optimizers, it's always better to be safe than sorryβ216Updated last week
- PyTorch implementation of models from the Zamba2 series.β178Updated 2 months ago
- β50Updated 2 weeks ago
- Chain of Experts (CoE) enables communication between experts within Mixture-of-Experts (MoE) modelsβ154Updated 3 weeks ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLMβ158Updated 8 months ago
- Code accompanying the paper "Generalized Interpolating Discrete Diffusion"β70Updated 3 weeks ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β229Updated 2 months ago
- Low-bit optimizers for PyTorchβ126Updated last year
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.β195Updated 8 months ago
- RWKV-7: Surpassing GPTβ82Updated 4 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMsβ236Updated last month
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Modelsβ130Updated 9 months ago
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ277Updated 3 weeks ago