zhihu / TLLM_QMMLinks
TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preprocessing to align with popular quantization alogirthms such as AWQ and GPTQ, and combine them with new FP8 quantization.
☆16Updated last year
Alternatives and similar repositories for TLLM_QMM
Users that are interested in TLLM_QMM are comparing it to the libraries listed below
Sorting:
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆115Updated last year
- TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.☆97Updated 2 years ago
- ☆129Updated 10 months ago
- PyTorch distributed training acceleration framework☆53Updated 2 months ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆113Updated 5 months ago
- ☆150Updated 9 months ago
- ☆507Updated last month
- GLake: optimizing GPU memory management and IO transmission.☆483Updated 7 months ago
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆265Updated 2 months ago
- Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.☆269Updated 2 years ago
- ☆97Updated 7 months ago
- A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster☆159Updated last year
- DeepRec Extension is an easy-to-use, stable and efficient large-scale distributed training system based on DeepRec.☆11Updated last year
- ☆58Updated 5 years ago
- KV cache store for distributed LLM inference☆346Updated last month
- HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…☆173Updated 3 weeks ago
- ☆139Updated last year
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆63Updated this week
- ☆23Updated 9 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆41Updated 7 months ago
- ☆205Updated 5 months ago
- A lightweight design for computation-communication overlap.☆181Updated 2 weeks ago
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆479Updated last year
- High performance RDMA-based distributed feature collection component for training GNN model on EXTREMELY large graph☆55Updated 3 years ago
- Pipeline Parallelism Emulation and Visualization☆68Updated 4 months ago
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆139Updated last month
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆94Updated last month
- ☆307Updated 3 weeks ago
- ☆100Updated last year
- Fast and memory-efficient exact attention☆96Updated this week