SqueezeBits / Torch-TRTLLM
Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.
☆41Updated this week
Alternatives and similar repositories for Torch-TRTLLM
Users that are interested in Torch-TRTLLM are comparing it to the libraries listed below
Sorting:
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆117Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆73Updated 8 months ago
- Make triton easier☆47Updated 11 months ago
- vLLM adapter for a TGIS-compatible gRPC server.☆29Updated this week
- A performance library for machine learning applications.☆184Updated last year
- ☆69Updated last month
- Easy and Efficient Quantization for Transformers☆197Updated 3 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆41Updated last year
- OwLite is a low-code AI model compression toolkit for AI models.☆43Updated 2 months ago
- PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"☆60Updated last month
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆23Updated last year
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆130Updated this week
- Model compression for ONNX☆92Updated 6 months ago
- FriendliAI Model Hub☆92Updated 2 years ago
- OSLO: Open Source for Large-scale Optimization☆175Updated last year
- Inference code for LLaMA models☆21Updated last month
- Fast low-bit matmul kernels in Triton☆301Updated this week
- Load compute kernels from the Hub☆119Updated last week
- Triton kernels for Flux☆20Updated 4 months ago
- Best practices for testing advanced Mixtral, DeepSeek, and Qwen series MoE models using Megatron Core MoE.☆10Updated 3 weeks ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆173Updated last week
- ☆106Updated 11 months ago
- extensible collectives library in triton☆86Updated last month
- ☆46Updated 8 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆124Updated this week
- Repository for CPU Kernel Generation for LLM Inference☆26Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on disk☆112Updated this week
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆308Updated 10 months ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆121Updated last month
- Applied AI experiments and examples for PyTorch☆267Updated this week