CactusQ / TensorRT-LLM-Tutorial
Getting started with TensorRT-LLM using BLOOM as a case study
☆13Updated 10 months ago
Alternatives and similar repositories for TensorRT-LLM-Tutorial:
Users that are interested in TensorRT-LLM-Tutorial are comparing it to the libraries listed below
- A collection of all available inference solutions for the LLMs☆74Updated 4 months ago
- ☆216Updated last week
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆153Updated this week
- Unofficial implementation of https://arxiv.org/pdf/2407.14679☆41Updated 4 months ago
- Utils for Unsloth☆27Updated this week
- A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..☆175Updated last week
- LLM KV cache compression made easy☆305Updated last week
- Easy and Efficient Quantization for Transformers☆191Updated last month
- ☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化☆45Updated last year
- Efficient LLM Inference over Long Sequences☆347Updated 3 weeks ago
- This reference can be used with any existing OpenAI integrated apps to run with TRT-LLM inference locally on GeForce GPU on Windows inste…☆117Updated 10 months ago
- A family of compressed models obtained via pruning and knowledge distillation☆309Updated 2 months ago
- 🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.☆138Updated 5 months ago
- OpenAI compatible API for TensorRT LLM triton backend☆186Updated 5 months ago
- ☆168Updated 3 months ago
- Distributed training (multi-node) of a Transformer model☆49Updated 9 months ago
- vLLM Router☆17Updated 10 months ago
- Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation☆33Updated 10 months ago
- ☆52Updated 7 months ago
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆183Updated last month
- ☆55Updated last month
- The Triton backend for TensorRT.☆68Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆257Updated 3 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆259Updated last week
- Parameter-efficient finetuning script for Phi-3-vision, the strong multimodal language model by Microsoft.☆56Updated 7 months ago
- Mixed precision training from scratch with Tensors and CUDA☆21Updated 8 months ago
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆107Updated last month
- ☆95Updated 2 weeks ago
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆125Updated 6 months ago
- Nexusflow function call, tool use, and agent benchmarks.☆18Updated last month