Cornell-RelaxML / qtipLinks

☆152

Alternatives and similar repositories for qtip

Users that are interested in qtip are comparing it to the libraries listed below

Sorting:

IST-DASLab / Quartet
☆102Updated this week
FasterDecoding / TEAL
☆145Updated 8 months ago
Cornell-RelaxML / yaqa-quantization
☆60Updated 4 months ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆59Updated last year
IST-DASLab / QuEST
Work in progress.
☆74Updated 3 months ago
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆308Updated 5 months ago
Anonymous1252022 / fp4-all-the-way
☆35Updated 5 months ago
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆142Updated 8 months ago
Aaronhuang-778 / BiLLM
[ICML 2024] BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
☆228Updated 9 months ago
ChenMnZ / PrefixQuant
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆160Updated 5 months ago
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆180Updated this week
amazon-science / mxfp4-llm
Official implementation for Training LLMs with MXFP4
☆100Updated 6 months ago
GATECH-EIC / ShiftAddLLM
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
☆110Updated last year
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆156Updated last year
snu-mllab / GuidedQuant
Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)
☆45Updated 3 months ago
IST-DASLab / MoE-Quant
Code for data-aware compression of DeepSeek models
☆56Updated 4 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆84Updated last year
spcl / QuaRot
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆433Updated 11 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆169Updated last year
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆389Updated last year
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆238Updated 11 months ago
jy-yuan / KIVI
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆329Updated last month
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆253Updated last week
facebookresearch / ParetoQ
This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"
☆108Updated last week
Dao-AILab / grouped-latent-attention
☆130Updated 4 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
FasterDecoding / BitDelta
☆202Updated 10 months ago
HazyResearch / lolcats
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆248Updated 8 months ago
astramind-ai / BitMat
An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
☆154Updated last year
thu-nics / MoA
[CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
☆147Updated 3 months ago