zhihu / TLLM_QMMLinks
TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preprocessing to align with popular quantization alogirthms such as AWQ and GPTQ, and combine them with new FP8 quantization.
☆16Updated 11 months ago
Alternatives and similar repositories for TLLM_QMM
Users that are interested in TLLM_QMM are comparing it to the libraries listed below
Sorting:
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆100Updated last year
- ☆127Updated 5 months ago
- ☆37Updated 5 months ago
- ☆21Updated last year
- TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.☆94Updated 2 years ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆81Updated 2 weeks ago
- PyTorch distributed training acceleration framework☆49Updated 3 months ago
- ☆58Updated 4 years ago
- ☆148Updated 4 months ago
- ☆332Updated 4 months ago
- Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.☆267Updated 2 years ago
- A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster☆158Updated last year
- Fast and memory-efficient exact attention☆72Updated last month
- HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…☆146Updated last week
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…☆55Updated 10 months ago
- ☆139Updated last year
- Transformer related optimization, including BERT, GPT☆59Updated last year
- DeepSeek-V3/R1 inference performance simulator☆134Updated 2 months ago
- ☆79Updated last year
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆242Updated 2 weeks ago
- ☆53Updated last year
- ☆127Updated 3 years ago
- KV cache store for distributed LLM inference☆254Updated last week
- ☆194Updated last month
- DeepRec Extension is an easy-to-use, stable and efficient large-scale distributed training system based on DeepRec.☆11Updated last year
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆116Updated last year
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆37Updated 3 months ago
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆254Updated last week
- GLake: optimizing GPU memory management and IO transmission.☆463Updated 2 months ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆124Updated last month