casper-hansen / AutoAWQLinks

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

☆2,269

Alternatives and similar repositories for AutoAWQ

Users that are interested in AutoAWQ are comparing it to the libraries listed below

Sorting:

mit-han-lab / llm-awq
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆3,340Updated 4 months ago
IST-DASLab / gptq
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
☆2,219Updated last year
AutoGPTQ / AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
☆4,988Updated 7 months ago
S-LoRA / S-LoRA
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
☆1,868Updated last year
deepspeedai / DeepSpeed-MII
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
☆2,074Updated 4 months ago
vllm-project / llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆2,238Updated this week
jquesnelle / yarn
YaRN: Efficient Context Window Extension of Large Language Models
☆1,637Updated last year
FasterDecoding / Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
☆2,660Updated last year
hao-ai-lab / LookaheadDecoding
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,302Updated 8 months ago
ModelCloud / GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…
☆876Updated last week
ModelTC / LightLLM
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…
☆3,717Updated last week
mit-han-lab / smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆1,550Updated last year
flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
☆4,062Updated this week
huggingface / nanotron
Minimalistic large language model 3D-parallelism training
☆2,323Updated 2 months ago
triton-inference-server / tensorrtllm_backend
The Triton TensorRT-LLM Backend
☆906Updated this week
punica-ai / punica
Serving multiple LoRA finetuned LLM as one
☆1,116Updated last year
dropbox / hqq
Official implementation of Half-Quadratic Quantization (HQQ)
☆891Updated 3 weeks ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆946Updated last year
jiaweizzhao / GaLore
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
☆1,621Updated last year
horseee / LLM-Pruner
[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baich…
☆1,080Updated last year
InternLM / lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
☆7,265Updated this week
microsoft / MInference
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,148Updated last month
mlc-ai / xgrammar
Fast, Flexible and Portable Structured Generation
☆1,375Updated last week
RahulSChand / gpu_poor
Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization
☆1,380Updated 11 months ago
OpenGVLab / OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆872Updated 5 months ago
huggingface / optimum-quanto
A pytorch quantization backend for optimum
☆1,009Updated 3 weeks ago
intel / intel-extension-for-transformers
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…
☆2,164Updated last year
NVIDIA / TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on H…
☆2,909Updated this week
deepspeedai / Megatron-DeepSpeed
Ongoing research training transformer language models at scale, including: BERT & GPT-2
☆2,188Updated 3 months ago
bitsandbytes-foundation / bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
☆7,755Updated this week