zhihu / TLLM_QMMLinks
TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preprocessing to align with popular quantization alogirthms such as AWQ and GPTQ, and combine them with new FP8 quantization.
☆16Updated last year
Alternatives and similar repositories for TLLM_QMM
Users that are interested in TLLM_QMM are comparing it to the libraries listed below
Sorting:
- PyTorch distributed training acceleration framework☆51Updated 5 months ago
- TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.☆94Updated 2 years ago
- ☆477Updated this week
- Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.☆267Updated 2 years ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆104Updated 2 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆110Updated last year
- KV cache store for distributed LLM inference☆305Updated 2 months ago
- ☆128Updated 7 months ago
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆256Updated this week
- GLake: optimizing GPU memory management and IO transmission.☆471Updated 4 months ago
- A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster☆158Updated last year
- ☆220Updated last year
- HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…☆163Updated this week
- ☆44Updated 7 months ago
- RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.☆830Updated 2 weeks ago
- ☆58Updated 4 years ago
- The driver for LMCache core to run in vLLM☆45Updated 6 months ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆119Updated last year
- Pipeline Parallelism Emulation and Visualization☆56Updated last month
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆185Updated this week
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆264Updated this week
- ☆149Updated 7 months ago
- DeepSeek-V3/R1 inference performance simulator☆164Updated 4 months ago
- A library developed by Volcano Engine for high-performance reading and writing of PyTorch model files.☆20Updated 7 months ago
- ☆23Updated 7 months ago
- ☆195Updated 3 months ago
- A lightweight parameter server interface☆80Updated 2 years ago
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆43Updated last week
- ☆127Updated 4 years ago
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…☆61Updated last year