triton-inference-server / vllm_backendLinks

☆280

Alternatives and similar repositories for vllm_backend

Users that are interested in vllm_backend are comparing it to the libraries listed below

Sorting:

npuichigo / openai_trtllm
OpenAI compatible API for TensorRT LLM triton backend
☆209Updated last year
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆198Updated last month
triton-inference-server / tensorrtllm_backend
The Triton TensorRT-LLM Backend
☆872Updated this week
triton-inference-server / perf_analyzer
☆98Updated this week
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆265Updated 9 months ago
neuralmagic / AutoFP8
☆195Updated 2 months ago
huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…
☆307Updated 2 months ago
triton-inference-server / backend
Common source, scripts and utilities for creating Triton backends.
☆336Updated last week
triton-inference-server / onnxruntime_backend
The Triton backend for the ONNX Runtime.
☆156Updated last week
anyscale / llm-continuous-batching-benchmarks
☆120Updated last year
vllm-project / guidellm
Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
☆438Updated this week
lapp0 / lm-inference-engines
Comparison of Language Model Inference Engines
☆222Updated 7 months ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆856Updated 3 weeks ago
microsoft / batch-inference
Dynamic batching library for Deep Learning inference. Tutorials for LLM, GPT scenarios.
☆101Updated 11 months ago
triton-inference-server / tensorrt_backend
The Triton backend for TensorRT.
☆77Updated last week
intel / llm-on-ray
Pretrain, finetune and serve LLMs on Intel platforms with Ray
☆128Updated 3 weeks ago
triton-inference-server / model_analyzer
Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Serv…
☆482Updated last week
bentoml / llm-bench
☆55Updated 8 months ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆868Updated 10 months ago
triton-inference-server / tutorials
This repository contains tutorials and examples for Triton Inference Server
☆742Updated last week
vectorch-ai / ScaleLLM
A high-performance inference system for large language models, designed for production environments.
☆460Updated last week
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆198Updated this week
triton-inference-server / model_navigator
Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.
☆210Updated 3 months ago
triton-inference-server / triton_cli
Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inferen…
☆64Updated last week
ray-project / llmperf
LLMPerf is a library for validating and benchmarking LLMs
☆970Updated 7 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆385Updated last month
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆175Updated 4 months ago
run-ai / llmperf
☆58Updated 10 months ago
HabanaAI / vllm-fork
A high-throughput and memory-efficient inference and serving engine for LLMs
☆78Updated this week
inferflow / inferflow
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
☆244Updated last year