Vahe1994/SpQR

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/Vahe1994/SpQR)

Vahe1994 / SpQR

☆554

Alternatives and similar repositories for SpQR

Users that are interested in SpQR are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

SqueezeAILab / SqueezeLLM
View on GitHub
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆722Aug 13, 2024Updated last year
OpenGVLab / OmniQuant
View on GitHub
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆903Nov 26, 2025Updated 7 months ago
IST-DASLab / gptq
View on GitHub
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
☆2,336Mar 27, 2024Updated 2 years ago
hahnyuan / RPTQ4LLM
View on GitHub
Reorder-based post-training quantization for large language model
☆199May 17, 2023Updated 3 years ago
mit-han-lab / llm-awq
View on GitHub
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆3,594Jul 17, 2025Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
efeslab / Atom
View on GitHub
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆344Jul 2, 2024Updated 2 years ago
hahnyuan / PB-LLM
View on GitHub
PB-LLM: Partially Binarized Large Language Models
☆158Nov 20, 2023Updated 2 years ago
locuslab / wanda
View on GitHub
A simple and effective LLM pruning approach.
☆869Aug 9, 2024Updated last year
IST-DASLab / QUIK
View on GitHub
Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024
☆185Apr 16, 2024Updated 2 years ago
Cornell-RelaxML / QuIP
View on GitHub
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
☆400Feb 24, 2024Updated 2 years ago
dropbox / hqq
View on GitHub
Official implementation of Half-Quadratic Quantization (HQQ)
☆949Feb 26, 2026Updated 4 months ago
spcl / QuaRot
View on GitHub
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆523Nov 26, 2024Updated last year
artidoro / qlora
View on GitHub
QLoRA: Efficient Finetuning of Quantized LLMs
☆10,965Jun 10, 2024Updated 2 years ago
ModelTC / Outlier_Suppression_Plus
View on GitHub
Official implementation of the EMNLP23 paper: Outlier Suppression+: Accurate quantization of large language models by equivalent and opti…
☆52Oct 21, 2023Updated 2 years ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
IST-DASLab / sparsegpt
View on GitHub
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
☆890Aug 20, 2024Updated last year
AutoGPTQ / AutoGPTQ
View on GitHub
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
☆5,075Apr 11, 2025Updated last year
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆850Mar 6, 2025Updated last year
bitsandbytes-foundation / bitsandbytes
View on GitHub
Accessible large language models via k-bit quantization for PyTorch.
☆8,337Updated this week
mit-han-lab / smoothquant
View on GitHub
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,671Jul 12, 2024Updated 2 years ago
Aaronhuang-778 / BiLLM
View on GitHub
[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
☆235Jan 11, 2025Updated last year
Vahe1994 / AQLM
View on GitHub
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.p…
☆1,324Feb 26, 2026Updated 4 months ago
qwopqwop200 / GPTQ-for-LLaMa
View on GitHub
4 bits quantization of LLaMA using GPTQ
☆3,072Jul 13, 2024Updated 2 years ago
turboderp / exllama
View on GitHub
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
☆2,934Sep 30, 2023Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
yuhuixu1993 / qa-lora
View on GitHub
Official PyTorch implementation of QA-LoRA
☆147Mar 13, 2024Updated 2 years ago
microsoft / TransformerCompression
View on GitHub
For releasing code related to compression methods for transformers, accompanying our publications
☆461Jan 16, 2025Updated last year
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,111Sep 4, 2024Updated last year
IST-DASLab / qmoe
View on GitHub
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
☆278Nov 3, 2023Updated 2 years ago
SqueezeAILab / KVQuant
View on GitHub
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆431Aug 13, 2024Updated last year
Cornell-RelaxML / quip-sharp
View on GitHub
☆602Oct 29, 2024Updated last year
IST-DASLab / OBC
View on GitHub
Code for the NeurIPS 2022 paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning".
☆132Jul 11, 2023Updated 3 years ago
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆420Nov 20, 2025Updated 8 months ago
ChenMnZ / PrefixQuant
View on GitHub
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆176Nov 26, 2025Updated 7 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
IST-DASLab / Quartet
View on GitHub
☆127Mar 18, 2026Updated 4 months ago
CStanKonrad / long_llama
View on GitHub
LongLLaMA is a large language model capable of handling long contexts. It is based on OpenLLaMA and fine-tuned with the Focused Transform…
☆1,465Nov 7, 2023Updated 2 years ago
rmihaylov / mpttune
View on GitHub
Tune MPTs
☆84Jun 17, 2023Updated 3 years ago
IST-DASLab / QIGen
View on GitHub
Repository for CPU Kernel Generation for LLM Inference
☆28Jul 13, 2023Updated 3 years ago
FasterDecoding / Medusa
View on GitHub
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
☆2,758Jun 25, 2024Updated 2 years ago
OpenLMLab / LOMO
View on GitHub
LOMO: LOw-Memory Optimization
☆994Jul 2, 2024Updated 2 years ago
DachengLi1 / LongChat
View on GitHub
Official repository for LongChat and LongEval
☆536May 24, 2024Updated 2 years ago