intel / auto-roundLinks
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Transformers, and vLLM. Export your models effortlessly to autogptq, autoawq, gguf and autoround formats with high accuracy even at extremely low bit precision.
☆596Updated this week
Alternatives and similar repositories for auto-round
Users that are interested in auto-round are comparing it to the libraries listed below
Sorting:
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆652Updated 3 months ago
- Official implementation of Half-Quadratic Quantization (HQQ)☆864Updated this week
- Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…☆735Updated last week
- An innovative library for efficient LLM inference via low-bit quantization☆349Updated 11 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆735Updated 5 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆880Updated 11 months ago
- Efficient LLM Inference over Long Sequences☆389Updated last month
- A family of compressed models obtained via pruning and knowledge distillation☆348Updated 9 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆293Updated 3 months ago
- A throughput-oriented high-performance serving framework for LLMs☆872Updated last week
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…☆1,105Updated last week
- ☆552Updated 9 months ago
- For releasing code related to compression methods for transformers, accompanying our publications☆440Updated 7 months ago
- LLM KV cache compression made easy☆586Updated this week
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆309Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆266Updated 10 months ago
- A pytorch quantization backend for optimum☆984Updated last month
- A safetensors extension to efficiently store sparse quantized tensors on disk☆149Updated last week
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆1,817Updated this week
- ☆195Updated 3 months ago
- [ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.☆839Updated 3 months ago
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆414Updated 8 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆199Updated last year
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆327Updated 3 months ago
- scalable and robust tree-based speculative decoding algorithm☆355Updated 6 months ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.☆1,480Updated last week
- Easy and Efficient Quantization for Transformers☆201Updated last month
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆518Updated 2 weeks ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆372Updated last year
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆210Updated last week