SqueezeBits / Torch-TRTLLMLinks

Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.

☆49

Alternatives and similar repositories for Torch-TRTLLM

Users that are interested in Torch-TRTLLM are comparing it to the libraries listed below

Sorting:

SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆203Updated 3 months ago
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆265Updated last year
vedantroy / gpu_kernels
☆27Updated last year
triton-inference-server / vllm_backend
☆300Updated this week
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆83Updated last year
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆390Updated 3 months ago
ailzhang / minPP
Pipeline parallelism for the minimalist
☆35Updated 2 months ago
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆148Updated this week
neuralmagic / AutoFP8
☆201Updated 5 months ago
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆270Updated last week
triton-inference-server / onnxruntime_backend
The Triton backend for the ONNX Runtime.
☆162Updated this week
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆379Updated 2 weeks ago
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆57Updated last year
neuralmagic / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆167Updated this week
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆216Updated last year
deepspeedai / DeepSpeed-Kernels
☆72Updated 6 months ago
anyscale / llm-continuous-batching-benchmarks
☆121Updated last year
Azure / MS-AMP-Examples
Examples for MS-AMP package.
☆30Updated 2 months ago
kakaobrain / trident
A performance library for machine learning applications.
☆184Updated 2 years ago
InternLM / turbomind
☆95Updated 6 months ago
OpenBMB / CPM.cu
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…
☆197Updated 3 weeks ago
tspeterkim / paged-attention-minimal
a minimal cache manager for PagedAttention, on top of llama3.
☆123Updated last year
junstar92 / nvidia-libraries-study
☆54Updated 10 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆265Updated 2 months ago
facebookresearch / any4
Quantize transformers to any learned arbitrary 4-bit numeric format
☆48Updated 3 months ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆320Updated last year
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆55Updated 8 months ago
facebookresearch / ParetoQ
This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"
☆107Updated 4 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆143Updated last month