casper-hansen / AutoAWQ_kernels
☆48Updated last week
Related projects: ⓘ
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆141Updated 3 weeks ago
- ☆50Updated 3 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆47Updated 2 weeks ago
- ☆145Updated last month
- ☆110Updated 4 months ago
- Low-bit optimizers for PyTorch☆109Updated 11 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆134Updated 2 months ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated 9 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆50Updated 3 months ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆156Updated 9 months ago
- An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).☆90Updated this week
- GPU operators for sparse tensor operations☆27Updated 6 months ago
- ☆75Updated this week
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆205Updated this week
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆29Updated 6 months ago
- An easy-to-use package for implementing SmoothQuant for LLMs☆78Updated 4 months ago
- ☆130Updated last year
- Fast Inference of MoE Models with CPU-GPU Orchestration☆163Updated 3 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆173Updated 3 months ago
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆185Updated 3 weeks ago
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆213Updated 3 weeks ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆69Updated 6 months ago
- QAQ: Quality Adaptive Quantization for LLM KV Cache☆42Updated 5 months ago
- ☆60Updated last month
- A toolkit enhances PyTorch with specialized functions for low-bit quantized neural networks.☆24Updated 2 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆121Updated 3 months ago
- ☆67Updated last week
- Implementation of Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting☆39Updated 2 months ago
- ☆83Updated 3 weeks ago
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆55Updated this week