NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆172Updated 2 months ago
Related projects: ⓘ
- ☆145Updated last month
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆562Updated 2 weeks ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆250Updated this week
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆141Updated 3 weeks ago
- ☆108Updated 6 months ago
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆282Updated last month
- GPTQ inference Triton kernel☆273Updated last year
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆106Updated 6 months ago
- ☆170Updated this week
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆399Updated 2 weeks ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆156Updated 9 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆87Updated last year
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆416Updated this week
- Explorations into some recent techniques surrounding speculative decoding☆190Updated 11 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆173Updated 3 months ago
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆231Updated this week
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆258Updated 2 months ago
- Applied AI experiments and examples for PyTorch☆123Updated last month
- Comparison of Language Model Inference Engines☆178Updated 2 weeks ago
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆205Updated this week
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆339Updated 6 months ago
- Official PyTorch implementation of QA-LoRA☆111Updated 6 months ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆150Updated last week
- ☆164Updated 4 months ago
- A throughput-oriented high-performance serving framework for LLMs☆470Updated this week
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆240Updated 2 weeks ago
- For releasing code related to compression methods for transformers, accompanying our publications☆356Updated 2 weeks ago
- OpenAI compatible API for TensorRT LLM triton backend☆148Updated last month
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆156Updated this week
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆342Updated this week