zhihu / TLLM_QMMLinks
TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preprocessing to align with popular quantization alogirthms such as AWQ and GPTQ, and combine them with new FP8 quantization.
☆16Updated last year
Alternatives and similar repositories for TLLM_QMM
Users that are interested in TLLM_QMM are comparing it to the libraries listed below
Sorting:
- PyTorch distributed training acceleration framework☆55Updated 6 months ago
- TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.☆99Updated 2 years ago
- ☆130Updated last year
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆120Updated last year
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆123Updated last month
- ☆523Updated 3 weeks ago
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆92Updated 3 weeks ago
- Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.☆271Updated 2 years ago
- GLake: optimizing GPU memory management and IO transmission.☆497Updated 10 months ago
- ☆152Updated last year
- A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster☆161Updated last year
- HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…☆192Updated this week
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆298Updated this week
- ☆141Updated last year
- KV cache store for distributed LLM inference☆390Updated 3 months ago
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆274Updated 6 months ago
- ☆206Updated 9 months ago
- ☆96Updated 10 months ago
- ☆47Updated last year
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆477Updated last year
- LLM training technologies developed by kwai☆70Updated 3 weeks ago
- SGLang kernel library for NPU☆96Updated last week
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆44Updated 11 months ago
- Fast and memory-efficient exact attention☆114Updated this week
- High Performance LLM Inference Operator Library☆695Updated last week
- ☆219Updated 2 years ago
- Transformer related optimization, including BERT, GPT☆59Updated 2 years ago
- ☆76Updated last year
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆96Updated 5 months ago
- ☆155Updated 11 months ago