vllm-project / guidellmLinks

Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs

☆655

Alternatives and similar repositories for guidellm

Users that are interested in guidellm are comparing it to the libraries listed below

Sorting:

ray-project / llmperf
LLMPerf is a library for validating and benchmarking LLMs
☆1,032Updated 10 months ago
vllm-project / llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆2,106Updated this week
triton-inference-server / vllm_backend
☆302Updated last week
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆283Updated this week
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆904Updated last month
vllm-project / speculators
A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
☆60Updated this week
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆266Updated last year
run-ai / runai-model-streamer
☆258Updated this week
lapp0 / lm-inference-engines
Comparison of Language Model Inference Engines
☆231Updated 10 months ago
npuichigo / openai_trtllm
OpenAI compatible API for TensorRT LLM triton backend
☆215Updated last year
huggingface / inference-benchmarker
Inference server benchmarking tool
☆118Updated 3 weeks ago
triton-inference-server / perf_analyzer
☆114Updated 2 weeks ago
huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…
☆318Updated 3 weeks ago
run-ai / llmperf
☆58Updated last year
intel / auto-round
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.
☆668Updated this week
intel / neural-speed
An innovative library for efficient LLM inference via low-bit quantization
☆349Updated last year
ServerlessLLM / ServerlessLLM
Serverless LLM Serving for Everyone.
☆561Updated this week
ovg-project / kvcached
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
☆104Updated this week
mlc-ai / xgrammar
Fast, Flexible and Portable Structured Generation
☆1,309Updated last week
llm-d / llm-d
Achieve state of the art inference performance with modern accelerators on Kubernetes
☆1,907Updated this week
bentoml / llm-bench
☆56Updated 11 months ago
ModelCloud / GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…
☆842Updated this week
microsoft / MInference
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,141Updated 3 weeks ago
vllm-project / recipes
Common recipes to run vLLM
☆172Updated this week
vllm-project / production-stack
vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization
☆1,864Updated last week
microsoft / VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
☆659Updated 5 months ago
sgl-project / sgl-learning-materials
Materials for learning SGLang
☆615Updated 3 weeks ago
sgl-project / ome
OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)
☆292Updated this week
triton-inference-server / tensorrtllm_backend
The Triton TensorRT-LLM Backend
☆901Updated last week
mani-kantap / llm-inference-solutions
A collection of all available inference solutions for the LLMs
☆91Updated 7 months ago