CactusQ / TensorRT-LLM-Tutorial

Getting started with TensorRT-LLM using BLOOM as a case study

☆18

Alternatives and similar repositories for TensorRT-LLM-Tutorial:

Users that are interested in TensorRT-LLM-Tutorial are comparing it to the libraries listed below

triton-inference-server / vllm_backend
☆253Updated this week
pphuc25 / distil-cd
Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation
☆35Updated last year
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆168Updated last month
NVIDIA / trt-llm-as-openai-windows
This reference can be used with any existing OpenAI integrated apps to run with TRT-LLM inference locally on GeForce GPU on Windows inste…
☆120Updated last year
alperiox / Compact-Language-Models-via-Pruning-and-Knowledge-Distillation
Unofficial implementation of https://arxiv.org/pdf/2407.14679
☆44Updated 8 months ago
Macaronlin / LLaMA3-Quantization
A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..
☆182Updated 3 months ago
seongminp / hyperseg
Code for HyperSeg and HyperSum
☆12Updated 11 months ago
facebookresearch / LayerSkip
Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024
☆292Updated last week
LLM-inference-router / vllm-router
vLLM Router
☆28Updated last year
npuichigo / openai_trtllm
OpenAI compatible API for TensorRT LLM triton backend
☆205Updated 9 months ago
huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…
☆299Updated this week
hkproj / quantization-notes
Notes on quantization in neural networks
☆82Updated last year
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆197Updated 3 months ago
AI-Maker-Space / FastAPI-LLM-Model-Serving
How to quickly serve an LLM using Fast API, Celery, and Redis
☆15Updated last year
nickaggarwal / nvidia-triton-llm-streaming
Integrating SSE with NVIDIA Triton Inference Server using a Python backend and Zephyr model. There is very less documentation how to use …
☆10Updated 11 months ago
NVlabs / Minitron
A family of compressed models obtained via pruning and knowledge distillation
☆336Updated 5 months ago
Entropy-xcy / bitnet158
☆69Updated last year
nyunAI / PruneGPT
☆53Updated 11 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆373Updated last week
dusty-nv / NanoLLM
Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector…
☆262Updated 6 months ago
neuralmagic / guidellm
Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
☆291Updated this week
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆284Updated 2 months ago
triton-inference-server / perf_analyzer
☆69Updated this week
mani-kantap / llm-inference-solutions
A collection of all available inference solutions for the LLMs
☆87Updated 2 months ago
huggingface / optimum-tpu
Google TPU optimizations for transformers models
☆109Updated 3 months ago
UbiquitousLearning / SLM_Survey
☆91Updated 7 months ago
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆263Updated 7 months ago
samchaineau / llm_slerp_generation
Repo hosting codes and materials related to speeding LLMs' inference using token merging.
☆36Updated last year
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆308Updated 10 months ago
ducnt18121997 / Viet-Text-Normalization
A Python library for text normalization, specifically designed for Vietnamese and English text processing. This library provides comprehe…
☆11Updated last month