zhihu / TLLM_QMMLinks

TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preprocessing to align with popular quantization alogirthms such as AWQ and GPTQ, and combine them with new FP8 quantization.

☆16

Alternatives and similar repositories for TLLM_QMM

Users that are interested in TLLM_QMM are comparing it to the libraries listed below

Sorting:

AlibabaPAI / torchacc
PyTorch distributed training acceleration framework
☆51Updated 5 months ago
alibaba / TePDist
TePDist (TEnsor Program DISTributed) is an HLO-level automatic distributed system for DL models.
☆94Updated 2 years ago
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆95Updated last month
NVIDIA-Merlin / HierarchicalKV
HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…
☆156Updated 2 weeks ago
Tencent / KsanaLLM
☆455Updated this week
antgroup / glake
GLake: optimizing GPU memory management and IO transmission.
☆470Updated 3 months ago
DeepRec-AI / extension
DeepRec Extension is an easy-to-use, stable and efficient large-scale distributed training system based on DeepRec.
☆11Updated last year
alibaba / EasyParallelLibrary
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
☆267Updated 2 years ago
OpenPPL / ppl.llm.serving
☆128Updated 6 months ago
alibaba / rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
☆809Updated last month
DeepRec-AI / HybridBackend
A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster
☆158Updated last year
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆107Updated last year
bytedance / ByteMLPerf
AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…
☆251Updated 2 weeks ago
bytedance / InfiniStore
KV cache store for distributed LLM inference
☆288Updated last month
DeepRec-AI / serving
A high-performance serving system for DeepRec based on TensorFlow Serving.
☆19Updated last year
Victarry / PP-Schedule-Visualization
Pipeline Parallelism Emulation and Visualization
☆45Updated last month
OpenPPL / ppl.llm.kernel.cuda
☆149Updated 6 months ago
volcengine / veGiantModel
☆220Updated last year
modelscope / dash-infer
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …
☆259Updated last month
zw0610 / zw0610.github.io
☆58Updated 4 years ago
triton-inference-server / triton_distributed
☆51Updated 4 months ago
OpenPPL / ppl.nn.llm
☆139Updated last year
Azure / msccl-executor-nccl
☆37Updated 7 months ago
AlibabaPAI / llumnix
Efficient and easy multi-instance LLM serving
☆448Updated this week
neuralmagic / AutoFP8
☆195Updated 2 months ago
Tencent / WeChat-TFCC
☆127Updated 4 years ago
volcengine / veTurboIO
A library developed by Volcano Engine for high-performance reading and writing of PyTorch model files.
☆20Updated 6 months ago
FlagOpen / FlagScale
FlagScale is a large model toolkit based on open-sourced projects.
☆321Updated this week
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆58Updated 11 months ago
triton-inference-server / core
The core library and APIs implementing the Triton Inference Server.
☆138Updated this week