thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆119Updated last year
Related projects ⓘ
Alternatives and complementary repositories for low-bit-optimizers
- ☆134Updated last year
- ☆199Updated 5 months ago
- An algorithm for static activation quantization of LLMs☆77Updated last week
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆147Updated 4 months ago
- ☆96Updated last month
- Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"☆123Updated 6 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆185Updated last month
- The official implementation of the EMNLP 2023 paper LLM-FP4☆167Updated 11 months ago
- ☆188Updated 6 months ago
- ☆122Updated 9 months ago
- Implementation of Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting☆44Updated 4 months ago
- Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)☆138Updated 2 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆49Updated 3 weeks ago
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers☆195Updated 3 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆208Updated 3 weeks ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆241Updated last month
- [ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…☆98Updated 5 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆57Updated 5 months ago
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆118Updated 4 months ago
- ☆98Updated 8 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆184Updated 6 months ago
- ☆132Updated last year
- PB-LLM: Partially Binarized Large Language Models☆148Updated last year
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆134Updated 5 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆81Updated 8 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (Official Code)☆135Updated last month
- ☆154Updated last month
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆92Updated last month
- Explorations into some recent techniques surrounding speculative decoding☆211Updated last year