mobiusml / hqqLinks

Official implementation of Half-Quadratic Quantization (HQQ)

☆883

Alternatives and similar repositories for hqq

Users that are interested in hqq are comparing it to the libraries listed below

Sorting:

Cornell-RelaxML / quip-sharp
☆559Updated 11 months ago
microsoft / VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
☆659Updated 5 months ago
huggingface / optimum-quanto
A pytorch quantization backend for optimum
☆995Updated last week
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆916Updated last year
intel / auto-round
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.
☆668Updated this week
microsoft / TransformerCompression
For releasing code related to compression methods for transformers, accompanying our publications
☆446Updated 9 months ago
SqueezeAILab / SqueezeLLM
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆704Updated last year
OpenGVLab / OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆859Updated 5 months ago
intel / neural-speed
An innovative library for efficient LLM inference via low-bit quantization
☆349Updated last year
Cornell-RelaxML / QuIP
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
☆385Updated last year
apoorvumang / prompt-lookup-decoding
☆572Updated last year
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆266Updated last year
arcee-ai / PruneMe
Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models
☆249Updated last year
microsoft / MInference
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,141Updated 3 weeks ago
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆766Updated 7 months ago
punica-ai / punica
Serving multiple LoRA finetuned LLM as one
☆1,101Updated last year
ModelCloud / GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…
☆828Updated last week
Vahe1994 / SpQR
☆546Updated 10 months ago
hao-ai-lab / LookaheadDecoding
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,288Updated 7 months ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆904Updated last month
microsoft / Samba
[ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
☆915Updated 5 months ago
huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…
☆318Updated 3 weeks ago
NVlabs / Minitron
A family of compressed models obtained via pruning and knowledge distillation
☆352Updated 11 months ago
lapp0 / lm-inference-engines
Comparison of Language Model Inference Engines
☆231Updated 10 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆390Updated 3 months ago
Infini-AI-Lab / Sequoia
scalable and robust tree-based speculative decoding algorithm
☆359Updated 8 months ago
vllm-project / llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆2,106Updated this week
yuhuixu1993 / qa-lora
Official PyTorch implementation of QA-LoRA
☆141Updated last year
NVIDIA / kvpress
LLM KV cache compression made easy
☆660Updated this week
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆306Updated 5 months ago