thu-ml / low-bit-optimizersLinks

Low-bit optimizers for PyTorch

☆130

Alternatives and similar repositories for low-bit-optimizers

Users that are interested in low-bit-optimizers are comparing it to the libraries listed below

Sorting:

Dao-AILab / grouped-latent-attention
☆123Updated 2 months ago
haochengxi / Train_Transformers_with_INT4
☆153Updated 2 years ago
yxli2123 / LoftQ
☆223Updated last year
YuchuanTian / DiJiang
[ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…
☆102Updated last year
OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆323Updated 5 months ago
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆220Updated last month
nbasyl / DoRA
Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"
☆124Updated last year
FasterDecoding / TEAL
☆137Updated 5 months ago
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆221Updated last month
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆165Updated last year
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆149Updated last month
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆211Updated last year
XunhaoLai / native-sparse-attention-triton
Efficient triton implementation of Native Sparse Attention.
☆186Updated 2 months ago
FasterDecoding / BitDelta
☆199Updated 8 months ago
berlino / gated_linear_attention
☆106Updated last year
kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆106Updated 2 years ago
alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆128Updated 11 months ago
YuchuanTian / RethinkTinyLM
[ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”
☆122Updated 6 months ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆54Updated last year
epfml / dynamic-sparse-flash-attention
☆147Updated 2 years ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆72Updated last year
teelinsan / parallel-decoding
Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"
☆119Updated last year
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
astramind-ai / Mixture-of-depths
Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆167Updated last year
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆230Updated 8 months ago
OpenNLPLab / LASP
Linear Attention Sequence Parallelism (LASP)
☆85Updated last year
SalesforceAIResearch / GemFilter
☆83Updated 6 months ago
pprp / Pruner-Zero
[ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs
☆91Updated 8 months ago
thunlp / Ouroboros
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
☆109Updated 4 months ago
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆212Updated 11 months ago