npuichigo / openai_trtllmLinks

OpenAI compatible API for TensorRT LLM triton backend

☆215

Alternatives and similar repositories for openai_trtllm

Users that are interested in openai_trtllm are comparing it to the libraries listed below

Sorting:

triton-inference-server / vllm_backend
☆302Updated this week
lapp0 / lm-inference-engines
Comparison of Language Model Inference Engines
☆232Updated 10 months ago
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆266Updated last year
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆202Updated 4 months ago
vectorch-ai / ScaleLLM
A high-performance inference system for large language models, designed for production environments.
☆479Updated 2 weeks ago
huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…
☆318Updated last month
neuralmagic / AutoFP8
☆205Updated 5 months ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆916Updated last year
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆180Updated 6 months ago
triton-inference-server / tensorrtllm_backend
The Triton TensorRT-LLM Backend
☆901Updated last week
intel / neural-speed
An innovative library for efficient LLM inference via low-bit quantization
☆349Updated last year
microsoft / batch-inference
Dynamic batching library for Deep Learning inference. Tutorials for LLM, GPT scenarios.
☆102Updated last year
vllm-project / guidellm
Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
☆655Updated this week
fw-ai / benchmark
Benchmark suite for LLMs from Fireworks.ai
☆82Updated last week
premAI-io / benchmarks
🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.
☆139Updated last year
bentoml / llm-bench
☆56Updated 11 months ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆909Updated this week
asprenger / ray_vllm_inference
A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.
☆73Updated last year
mobiusml / hqq
Official implementation of Half-Quadratic Quantization (HQQ)
☆883Updated last month
triton-inference-server / triton_cli
Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inferen…
☆70Updated 2 weeks ago
huggingface / inference-benchmarker
Inference server benchmarking tool
☆118Updated 3 weeks ago
intel / auto-round
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.
☆668Updated this week
punica-ai / punica
Serving multiple LoRA finetuned LLM as one
☆1,106Updated last year
triton-inference-server / fastertransformer_backend
☆413Updated last year
run-ai / runai-model-streamer
☆258Updated this week
vllm-project / speculators
A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
☆60Updated this week
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆180Updated this week
triton-inference-server / perf_analyzer
☆114Updated 2 weeks ago
NVIDIA-NeMo / Run
A tool to configure, launch and manage your machine learning experiments.
☆198Updated this week
anyscale / llm-continuous-batching-benchmarks
☆121Updated last year