zhihu / TLLM_QMM
TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preprocessing to align with popular quantization alogirthms such as AWQ and GPTQ, and combine them with new FP8 quantization.
☆12Updated 4 months ago
Related projects ⓘ
Alternatives and complementary repositories for TLLM_QMM
- Elastic Serverless Serving based on Kubernetes, provides 0 instance serving capability.☆10Updated 2 years ago
- A Go driver for Hive☆55Updated 3 months ago
- HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…☆133Updated 2 months ago
- TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.☆90Updated last year
- A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster☆156Updated 6 months ago
- distributed kv store☆39Updated 2 years ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆74Updated 7 months ago
- Elastic Deep Learning Training based on Kubernetes by Leveraging EDL and Volcano☆31Updated last year
- ☆33Updated 2 months ago
- A high-performance serving system for DeepRec based on TensorFlow Serving.☆18Updated 11 months ago
- A flexible, high-performance serving system for machine learning models☆140Updated 2 years ago
- ☆51Updated last year
- ☆122Updated 3 years ago
- PyTorch distributed training acceleration framework☆32Updated this week
- ☆282Updated last week
- Kubernetes 原生的数据交付平台☆45Updated last year
- Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.☆264Updated last year
- DeepRec Extension is an easy-to-use, stable and efficient large-scale distributed training system based on DeepRec.☆10Updated 5 months ago
- Fault-tolerant for DL frameworks☆69Updated last year
- ☆208Updated last year
- ☆36Updated 3 years ago
- Paper Reading:涉及分布式、虚拟化、网络、机器学习☆22Updated 4 years ago
- ☆11Updated 4 years ago
- The core library and APIs implementing the Triton Inference Server.☆104Updated this week
- High performance RDMA-based distributed feature collection component for training GNN model on EXTREMELY large graph☆47Updated 2 years ago
- ☆100Updated 7 months ago
- ☆123Updated this week
- A high-performance distributed execution engine☆70Updated 3 weeks ago
- A lightweight parameter server interface☆73Updated last year
- A library developed by Volcano Engine for high-performance reading and writing of PyTorch model files.☆13Updated 5 months ago