ModelCloud / GPTQModel
Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
☆125Updated this week
Related projects ⓘ
Alternatives and complementary repositories for GPTQModel
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆224Updated last month
- ☆157Updated last month
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆149Updated last month
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆248Updated this week
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆241Updated last month
- Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models☆196Updated 6 months ago
- KV cache compression for high-throughput LLM inference☆87Updated this week
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆166Updated 3 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆173Updated 4 months ago
- An Open Source Toolkit For LLM Distillation☆356Updated 2 months ago
- ☆188Updated 6 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆305Updated 3 months ago
- DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆373Updated 3 weeks ago
- A family of compressed models obtained via pruning and knowledge distillation☆283Updated last week
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆74Updated last month
- ☆67Updated last week
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆284Updated 3 months ago
- Easy and Efficient Quantization for Transformers☆180Updated 4 months ago
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆78Updated this week
- For releasing code related to compression methods for transformers, accompanying our publications☆372Updated last month
- QuIP quantization☆46Updated 8 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated last month
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆443Updated last week
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆685Updated this week
- Micro Llama is a small Llama based model with 300M parameters trained from scratch with $500 budget☆126Updated 7 months ago
- A pipeline for LLM knowledge distillation☆78Updated 3 months ago
- EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language M…☆180Updated 3 weeks ago
- ☆199Updated 5 months ago
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆124Updated 4 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆253Updated last month