AlpinDale / QuIP-for-LlamaLinks

Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models

☆40

Alternatives and similar repositories for QuIP-for-Llama

Users that are interested in QuIP-for-Llama are comparing it to the libraries listed below

Sorting:

IST-DASLab / qmoe
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
☆277Updated 2 years ago
GreenBitAI / low_bit_llama
Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs
☆110Updated last year
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆156Updated 2 years ago
Cornell-RelaxML / QuIP
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
☆387Updated last year
chu-tianxiang / QuIP-for-all
QuIP quantization
☆60Updated last year
Cornell-RelaxML / qtip
☆154Updated 4 months ago
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆180Updated 7 months ago
hahnyuan / RPTQ4LLM
Reorder-based post-training quantization for large language model
☆194Updated 2 years ago
mlc-ai / llm-perf-bench
☆120Updated last year
astramind-ai / BitMat
An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
☆154Updated last year
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆310Updated 5 months ago
BlinkDL / modded-nanogpt-rwkv
RWKV-7: Surpassing GPT
☆100Updated last year
Infini-AI-Lab / Sequoia
scalable and robust tree-based speculative decoding algorithm
☆361Updated 9 months ago
IST-DASLab / QUIK
Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024
☆183Updated last year
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆313Updated 2 years ago
snu-mllab / GuidedQuant
Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)
☆47Updated 4 months ago
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆202Updated last year
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆390Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆85Updated last year
qwopqwop200 / gptqlora
GPTQLoRA: Efficient Finetuning of Quantized LLMs with GPTQ
☆101Updated 2 years ago
wdlctc / mini-s
☆52Updated last year
arcee-ai / PruneMe
Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models
☆254Updated last year
Aaronhuang-778 / BiLLM
[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
☆227Updated 10 months ago
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆240Updated last year
Cornell-RelaxML / quip-sharp
☆565Updated last year
aredden / torch-bnb-fp4
Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops
☆29Updated last year
HazyResearch / lolcats
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆249Updated 9 months ago
SqueezeAILab / SqueezeLLM
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆708Updated last year
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆143Updated 9 months ago
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆267Updated last year