SqueezeBits / Torch-TRTLLM
Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.
☆38Updated this week
Alternatives and similar repositories for Torch-TRTLLM:
Users that are interested in Torch-TRTLLM are comparing it to the libraries listed below
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆116Updated last year
- A performance library for machine learning applications.☆184Updated last year
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆127Updated this week
- FriendliAI Model Hub☆92Updated 2 years ago
- Triton kernels for Flux☆20Updated 3 months ago
- OSLO: Open Source for Large-scale Optimization☆175Updated last year
- Inference code for LLaMA models☆21Updated 3 weeks ago
- Easy and Efficient Quantization for Transformers☆197Updated 2 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆72Updated 7 months ago
- Load compute kernels from the Hub☆115Updated this week
- Performant kernels for symmetric tensors☆13Updated 8 months ago
- OwLite is a low-code AI model compression toolkit for AI models.☆43Updated 2 months ago
- ☆52Updated 5 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆165Updated last month
- Make triton easier☆47Updated 10 months ago
- ☆101Updated last year
- Fast low-bit matmul kernels in Triton☆291Updated this week
- A block oriented training approach for inference time optimization.☆32Updated 8 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆116Updated this week
- ☆101Updated 10 months ago
- 삼각형의 실전! Triton☆16Updated last year
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆22Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆115Updated 4 months ago
- ☆68Updated last month
- ☆26Updated last year
- Efficient LLM Inference over Long Sequences☆370Updated this week
- ring-attention experiments☆130Updated 6 months ago
- ☆157Updated last year
- Applied AI experiments and examples for PyTorch☆262Updated last month