SqueezeBits / Torch-TRTLLMLinks
Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.
☆41Updated this week
Alternatives and similar repositories for Torch-TRTLLM
Users that are interested in Torch-TRTLLM are comparing it to the libraries listed below
Sorting:
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆76Updated 9 months ago
- OwLite is a low-code AI model compression toolkit for AI models.☆45Updated 3 weeks ago
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆133Updated this week
- FriendliAI Model Hub☆91Updated 2 years ago
- A performance library for machine learning applications.☆184Updated last year
- Triton kernels for Flux☆20Updated 5 months ago
- ☆46Updated 9 months ago
- vLLM adapter for a TGIS-compatible gRPC server.☆30Updated this week
- This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.☆13Updated this week
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆116Updated 6 months ago
- FMO (Friendli Model Optimizer)☆12Updated 5 months ago
- ☆100Updated last year
- Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…☆62Updated last year
- OSLO: Open Source for Large-scale Optimization☆174Updated last year
- ☆71Updated 2 months ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆23Updated last year
- Easy and Efficient Quantization for Transformers☆199Updated 4 months ago
- Fast low-bit matmul kernels in Triton☆311Updated this week
- Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops☆28Updated last year
- ☆26Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆42Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on disk☆117Updated last week
- extensible collectives library in triton☆87Updated 2 months ago
- Inference code for LLaMA models☆21Updated 2 months ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆124Updated 2 months ago
- ☆138Updated this week
- Model compression for ONNX☆96Updated 6 months ago
- 삼각형의 실전! Triton☆16Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆223Updated 10 months ago