intel / auto-roundLinks
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Transformers, and vLLM. Export your models effortlessly to autogptq, autoawq, gguf and autoround formats with high accuracy even at extremely low bit precision.
☆551Updated this week
Alternatives and similar repositories for auto-round
Users that are interested in auto-round are comparing it to the libraries listed below
Sorting:
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆649Updated 3 months ago
- Official implementation of Half-Quadratic Quantization (HQQ)☆852Updated last week
- Efficient LLM Inference over Long Sequences☆385Updated last month
- An innovative library for efficient LLM inference via low-bit quantization☆349Updated 11 months ago
- Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…☆702Updated 2 weeks ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆730Updated 4 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆286Updated 2 months ago
- For releasing code related to compression methods for transformers, accompanying our publications☆436Updated 6 months ago
- A throughput-oriented high-performance serving framework for LLMs☆856Updated 3 weeks ago
- ☆549Updated 9 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…☆1,080Updated this week
- A family of compressed models obtained via pruning and knowledge distillation☆347Updated 8 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆868Updated 10 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆1,690Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆265Updated 9 months ago
- LLM KV cache compression made easy☆560Updated this week
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆410Updated 8 months ago
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆307Updated 2 months ago
- [ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.☆833Updated 2 months ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆480Updated 5 months ago
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆455Updated 2 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆323Updated 2 months ago
- ☆195Updated 2 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆141Updated this week
- A pytorch quantization backend for optimum☆977Updated 3 weeks ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆198Updated last year
- [ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs☆221Updated 6 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆362Updated 11 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆175Updated 4 months ago
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆192Updated 6 months ago