microsoft / VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
☆529Updated this week
Related projects ⓘ
Alternatives and complementary repositories for VPTQ
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆691Updated this week
- Official implementation of Half-Quadratic Quantization (HQQ)☆704Updated this week
- [NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces in…☆796Updated this week
- ☆507Updated 3 weeks ago
- A throughput-oriented high-performance serving framework for LLMs☆640Updated 2 months ago
- A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations☆740Updated last week
- scalable and robust tree-based speculative decoding algorithm☆318Updated 3 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated 2 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated last month
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆248Updated this week
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆226Updated last month
- DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆383Updated 3 weeks ago
- For releasing code related to compression methods for transformers, accompanying our publications☆373Updated last month
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆627Updated 2 months ago
- Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)☆827Updated last week
- Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.☆126Updated this week
- An Open Source Toolkit For LLM Distillation☆359Updated 2 months ago
- A family of compressed models obtained via pruning and knowledge distillation☆285Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆253Updated last month
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆262Updated last year
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆305Updated 3 months ago
- [ICML 2024] CLLMs: Consistency Large Language Models☆355Updated last week
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆451Updated 2 weeks ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆420Updated this week
- Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models☆196Updated 7 months ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆350Updated 8 months ago
- The homepage of OneBit model quantization framework.☆157Updated 4 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆187Updated this week
- Official implementation of "Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling"☆804Updated 3 months ago
- (ICML 2024) BiLLM: Pushing the Limit of Post-Training Quantization for LLMs☆197Updated 5 months ago