bytedance / ABQ-LLM
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
☆216Updated 5 months ago
Alternatives and similar repositories for ABQ-LLM:
Users that are interested in ABQ-LLM are comparing it to the libraries listed below
- Unified KV Cache Compression Methods for Auto-Regressive Models☆932Updated 2 months ago
- MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction☆80Updated 4 months ago
- Support mixed-precsion inference with vllm☆80Updated 2 months ago
- A highly optimized LLM inference acceleration engine for Llama and its variants.☆872Updated this week
- ☆107Updated 4 years ago
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference☆151Updated 4 months ago
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding☆240Updated 6 months ago
- SQuant [ICLR22]☆131Updated 2 years ago
- adds Sequence Parallelism into LLaMA-Factory☆399Updated this week
- Mixed precision inference by Tensorrt-LLM☆76Updated 4 months ago
- Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+…☆114Updated this week
- Official Code for "SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression"☆135Updated last month
- TVM Documentation in Chinese Simplified / TVM 中文文档☆809Updated this week
- The framework to prune LLMs to any size and any config.☆88Updated last year
- APOLLO: SGD-like Memory, AdamW-level Performance☆178Updated this week
- A toolkit enhances PyTorch with specialized functions for low-bit quantized neural networks.☆197Updated 8 months ago
- Deep Learning Deployment Framework: Supports tf/torch/trt/trtllm/vllm and other NN frameworks. Support dynamic batching, and streaming mo…☆156Updated last week
- JittorGeometric is a Jittor-based graph machine learning library.☆150Updated last month
- FlagPerf is an open-source software platform for benchmarking AI chips.☆324Updated last month
- Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization☆108Updated last month
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆107Updated 3 weeks ago
- Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models☆171Updated 4 months ago
- An easy-to-use package for implementing SmoothQuant for LLMs☆94Updated 9 months ago
- A deployment, monitoring and autoscaling service towards serverless LLM serving.☆149Updated last week
- ☆127Updated 2 months ago
- ☆139Updated 10 months ago
- A collection of memory efficient attention operators implemented in the Triton language.☆250Updated 9 months ago
- 使用deepspeed从头开始训练一个LLM,经过pretrain和sft阶段,验证llm学习知识、理解语言、回答问题的能力☆144Updated 8 months ago
- ☆72Updated 3 months ago