zhihu / TLLM_QMM
TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preprocessing to align with popular quantization alogirthms such as AWQ and GPTQ, and combine them with new FP8 quantization.
☆16Updated 8 months ago
Alternatives and similar repositories for TLLM_QMM:
Users that are interested in TLLM_QMM are comparing it to the libraries listed below
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆93Updated last year
- PyTorch distributed training acceleration framework☆46Updated last month
- KV cache store for distributed LLM inference☆107Updated this week
- ☆127Updated 3 months ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆109Updated 3 weeks ago
- ☆36Updated 3 months ago
- TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.☆92Updated last year
- The driver for LMCache core to run in vLLM☆36Updated last month
- ☆145Updated 2 months ago
- ☆76Updated last week
- HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…☆141Updated last week
- ☆126Updated 3 years ago
- ☆78Updated last year
- A library developed by Volcano Engine for high-performance reading and writing of PyTorch model files.☆16Updated 3 months ago
- ☆184Updated 6 months ago
- ☆58Updated 4 years ago
- ☆139Updated 11 months ago
- ☆91Updated 4 months ago
- ☆324Updated 2 months ago
- Elastic Deep Learning Training based on Kubernetes by Leveraging EDL and Volcano☆32Updated last year
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆232Updated last week
- DeepSeek-V3/R1 inference performance simulator☆89Updated last week
- ☆60Updated 3 months ago
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆240Updated 3 weeks ago
- Transformer related optimization, including BERT, GPT☆59Updated last year
- ☆128Updated 3 weeks ago
- Elastic Serverless Serving based on Kubernetes, provides 0 instance serving capability.☆11Updated 3 years ago
- ☆21Updated last year
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…☆51Updated 8 months ago
- ☆45Updated this week