microsoft / VPTQLinks

VPTQ, A Flexible and Extreme low-bit quantization algorithm

☆659

Alternatives and similar repositories for VPTQ

Users that are interested in VPTQ are comparing it to the libraries listed below

Sorting:

intel / auto-round
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.
☆668Updated this week
mobiusml / hqq
Official implementation of Half-Quadratic Quantization (HQQ)
☆883Updated last month
Cornell-RelaxML / quip-sharp
☆561Updated 11 months ago
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆766Updated 7 months ago
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆306Updated 5 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆390Updated 4 months ago
microsoft / TransformerCompression
For releasing code related to compression methods for transformers, accompanying our publications
☆446Updated 9 months ago
LeanModels / DFloat11
DFloat11: Lossless LLM Compression for Efficient GPU Inference
☆550Updated 2 months ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆916Updated last year
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆494Updated 8 months ago
NVlabs / Minitron
A family of compressed models obtained via pruning and knowledge distillation
☆352Updated 11 months ago
microsoft / MInference
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,141Updated 3 weeks ago
xuyuzhuang11 / OneBit
The homepage of OneBit model quantization framework.
☆193Updated 8 months ago
Infini-AI-Lab / Sequoia
scalable and robust tree-based speculative decoding algorithm
☆360Updated 8 months ago
Cornell-RelaxML / qtip
☆152Updated 4 months ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆904Updated last month
microsoft / BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆698Updated 2 months ago
OpenGVLab / OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆859Updated 5 months ago
HanGuo97 / flute
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
☆374Updated 6 months ago
NVIDIA / kvpress
LLM KV cache compression made easy
☆660Updated last week
intel / neural-speed
An innovative library for efficient LLM inference via low-bit quantization
☆349Updated last year
Aaronhuang-778 / BiLLM
[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
☆228Updated 9 months ago
spcl / QuaRot
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆433Updated 10 months ago
ModelCloud / GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…
☆842Updated this week
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆266Updated last year
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆389Updated last year
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆180Updated this week
SqueezeAILab / SqueezeLLM
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆704Updated last year
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆202Updated last year
apoorvumang / prompt-lookup-decoding
☆572Updated last year