wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.
☆148Updated last month
Related projects ⓘ
Alternatives and complementary repositories for QLLM
- ☆156Updated last month
- An easy-to-use package for implementing SmoothQuant for LLMs☆82Updated 5 months ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆166Updated 10 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆302Updated 2 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆222Updated last month
- ☆114Updated 6 months ago
- Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.☆118Updated this week
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆277Updated 4 months ago
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆245Updated this week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆196Updated last week
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆433Updated 2 months ago
- GPTQ inference Triton kernel☆283Updated last year
- ☆121Updated this week
- Reorder-based post-training quantization for large language model☆181Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMs☆250Updated 3 weeks ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆75Updated 3 weeks ago
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆240Updated 3 weeks ago
- Easy and Efficient Quantization for Transformers☆178Updated 3 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆183Updated last month
- ☆183Updated 6 months ago
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆253Updated 2 months ago
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆278Updated 3 months ago
- KV cache compression for high-throughput LLM inference☆82Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆611Updated 2 months ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆112Updated 8 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆146Updated 3 months ago
- A quantization algorithm for LLM☆101Updated 4 months ago
- PB-LLM: Partially Binarized Large Language Models☆146Updated 11 months ago
- ☆133Updated last year
- Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization☆57Updated this week