SqueezeBits / Torch-TRTLLM
Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.
☆31Updated this week
Alternatives and similar repositories for Torch-TRTLLM:
Users that are interested in Torch-TRTLLM are comparing it to the libraries listed below
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆116Updated last year
- OwLite is a low-code AI model compression toolkit for AI models.☆43Updated last month
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆109Updated 3 weeks ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆71Updated 6 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆107Updated this week
- ☆63Updated last week
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆128Updated this week
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆22Updated last year
- ☆40Updated this week
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆100Updated 3 months ago
- Easy and Efficient Quantization for Transformers☆195Updated last month
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆92Updated last year
- Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…☆58Updated last year
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆236Updated last month
- ☆122Updated last month
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆111Updated 3 months ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆192Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on disk☆92Updated this week
- Model compression for ONNX☆88Updated 4 months ago
- vLLM adapter for a TGIS-compatible gRPC server.☆25Updated this week
- Load compute kernels from the Hub☆107Updated last week
- ☆184Updated 6 months ago
- 삼각형의 실전! Triton☆16Updated last year
- Work in progress.☆50Updated 2 weeks ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆301Updated 9 months ago
- ☆67Updated 2 months ago
- Fast low-bit matmul kernels in Triton☆275Updated this week
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆105Updated 5 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year