ModelCloud / GPTQModelLinks
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.
β1,007Updated this week
Alternatives and similar repositories for GPTQModel
Users that are interested in GPTQModel are comparing it to the libraries listed below
Sorting:
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLMβ2,705Updated this week
- π―An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantizaβ¦β839Updated last week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β1,005Updated last year
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:β2,312Updated 8 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,180Updated 4 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β810Updated 11 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithmβ674Updated 9 months ago
- A throughput-oriented high-performance serving framework for LLMsβ945Updated 3 months ago
- Official implementation of Half-Quadratic Quantization (HQQ)β912Updated last month
- [EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.β672Updated 2 months ago
- The Triton TensorRT-LLM Backendβ918Updated this week
- Fast, Flexible and Portable Structured Generationβ1,527Updated last week
- An Open Source Toolkit For LLM Distillationβ859Updated last month
- [ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.β888Updated 2 months ago
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β676Updated this week
- β577Updated last year
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).β2,169Updated last week
- LLMPerf is a library for validating and benchmarking LLMsβ1,081Updated last year
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needsβ830Updated this week
- Low-bit LLM inference on CPU/NPU with lookup tableβ916Updated 8 months ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Modelsβ1,600Updated last year
- A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresseβ¦β1,925Updated this week
- For releasing code related to compression methods for transformers, accompanying our publicationsβ455Updated last year
- A pytorch quantization backend for optimumβ1,022Updated 2 months ago
- β206Updated 9 months ago
- A family of compressed models obtained via pruning and knowledge distillationβ364Updated 3 months ago
- Code repo for the paper "SpinQuant LLM quantization with learned rotations"β372Updated 11 months ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,316Updated 11 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Modelsβ327Updated 2 months ago
- DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inferenceβ600Updated 2 months ago