SqueezeBits / Torch-TRTLLMLinks
Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.
☆44Updated this week
Alternatives and similar repositories for Torch-TRTLLM
Users that are interested in Torch-TRTLLM are comparing it to the libraries listed below
Sorting:
- Easy and Efficient Quantization for Transformers☆198Updated 2 weeks ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆117Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMs☆264Updated 9 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆80Updated 10 months ago
- ☆271Updated last month
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆156Updated this week
- Efficient LLM Inference over Long Sequences☆382Updated 2 weeks ago
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆139Updated this week
- Step by step explanation/tutorial of llama2.c☆222Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on disk☆135Updated this week
- ☆195Updated 2 months ago
- ☆26Updated last year
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆160Updated this week
- The Triton backend for the ONNX Runtime.☆155Updated this week
- llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆335Updated 2 months ago
- vLLM adapter for a TGIS-compatible gRPC server.☆33Updated this week
- Make triton easier☆47Updated last year
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆11Updated last year
- Fast low-bit matmul kernels in Triton☆330Updated this week
- ☆73Updated 3 months ago
- Model compression for ONNX☆96Updated 7 months ago
- A collection of all available inference solutions for the LLMs☆91Updated 4 months ago
- A performance library for machine learning applications.☆184Updated last year
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆174Updated 3 months ago
- FriendliAI Model Hub☆91Updated 3 years ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆276Updated last month
- Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated last year
- ☆214Updated 5 months ago
- FMO (Friendli Model Optimizer)☆12Updated 6 months ago
- ☆120Updated last year