SqueezeBits / Torch-TRTLLMLinks
Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.
☆49Updated 3 months ago
Alternatives and similar repositories for Torch-TRTLLM
Users that are interested in Torch-TRTLLM are comparing it to the libraries listed below
Sorting:
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
 - Easy and Efficient Quantization for Transformers☆202Updated 4 months ago
 - A high-throughput and memory-efficient inference and serving engine for LLMs☆266Updated last year
 - Boosting 4-bit inference kernels with 2:4 Sparsity☆85Updated last year
 - ☆205Updated 5 months ago
 - A safetensors extension to efficiently store sparse quantized tensors on disk☆183Updated this week
 - The official implementation of the EMNLP 2023 paper LLM-FP4☆217Updated last year
 - ☆27Updated last year
 - Efficient LLM Inference over Long Sequences☆390Updated 4 months ago
 - ☆304Updated this week
 - Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated last year
 - Fast low-bit matmul kernels in Triton☆388Updated last week
 - Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆148Updated 2 weeks ago
 - [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆23Updated last year
 - This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"☆108Updated 2 weeks ago
 - QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆145Updated 2 months ago
 - A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆180Updated 7 months ago
 - A performance library for machine learning applications.☆184Updated 2 years ago
 - Quantize transformers to any learned arbitrary 4-bit numeric format☆48Updated 3 months ago
 - [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆309Updated 5 months ago
 - ☆121Updated last year
 - ☆71Updated 7 months ago
 - Code repo for the paper "SpinQuant LLM quantization with learned rotations"☆339Updated 8 months ago
 - Official implementation for Training LLMs with MXFP4☆101Updated 6 months ago
 - [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆323Updated last year
 - a minimal cache manager for PagedAttention, on top of llama3.☆125Updated last year
 - ☆218Updated 9 months ago
 - 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆216Updated this week
 - ArcticInference: vLLM plugin for high-throughput, low-latency inference☆288Updated last week
 - Step by step explanation/tutorial of llama2.c☆224Updated 2 years ago