zhihu / TLLM_QMM
TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preprocessing to align with popular quantization alogirthms such as AWQ and GPTQ, and combine them with new FP8 quantization.
☆16Updated 7 months ago
Alternatives and similar repositories for TLLM_QMM:
Users that are interested in TLLM_QMM are comparing it to the libraries listed below
- PyTorch distributed training acceleration framework☆42Updated last week
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆93Updated 11 months ago
- ☆36Updated 2 months ago
- TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.☆91Updated last year
- ☆58Updated 4 years ago
- ☆127Updated last month
- Elastic Serverless Serving based on Kubernetes, provides 0 instance serving capability.☆10Updated 3 years ago
- Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.☆266Updated last year
- ☆83Updated 3 months ago
- ☆174Updated 4 months ago
- HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…☆138Updated this week
- ☆314Updated last month
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…☆52Updated 6 months ago
- Artifact of OSDI '24 paper, ”Llumnix: Dynamic Scheduling for Large Language Model Serving“☆60Updated 8 months ago
- A distributed KV store for disaggregated LLM inference☆31Updated this week
- ☆140Updated 10 months ago
- GLake: optimizing GPU memory management and IO transmission.☆431Updated 2 months ago
- ☆104Updated 7 months ago
- A fast communication-overlapping library for tensor parallelism on GPUs.☆297Updated 3 months ago
- AI 基础知识 - GPU 架构、CUDA 编程以及大模型基础知识☆70Updated this week
- Efficient and easy multi-instance LLM serving☆298Updated this week
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆35Updated 5 months ago
- A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster☆157Updated 10 months ago
- ☆67Updated 2 months ago
- ☆142Updated last month
- High performance RDMA-based distributed feature collection component for training GNN model on EXTREMELY large graph☆50Updated 2 years ago
- ☆81Updated 5 months ago
- ☆76Updated last year
- ☆21Updated last year
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆102Updated this week