vllm-project / guidellmLinks
Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
☆799Updated this week
Alternatives and similar repositories for guidellm
Users that are interested in guidellm are comparing it to the libraries listed below
Sorting:
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆368Updated last week
- LLMPerf is a library for validating and benchmarking LLMs☆1,075Updated last year
- ☆322Updated this week
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆2,553Updated this week
- A throughput-oriented high-performance serving framework for LLMs☆936Updated 2 months ago
- A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM☆190Updated this week
- vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization☆2,098Updated last week
- ☆275Updated last week
- Inference server benchmarking tool☆136Updated 3 months ago
- Common recipes to run vLLM☆327Updated this week
- Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, T…☆355Updated this week
- Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond☆744Updated this week
- Comparison of Language Model Inference Engines☆238Updated last year
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆325Updated 3 months ago
- 🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantiza…☆806Updated this week
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆888Updated this week
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆971Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆267Updated last month
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆267Updated this week
- Materials for learning SGLang☆714Updated last week
- Achieve state of the art inference performance with modern accelerators on Kubernetes☆2,340Updated this week
- An Open Source Toolkit For LLM Distillation☆819Updated 3 weeks ago
- Fast, Flexible and Portable Structured Generation☆1,464Updated this week
- ☆131Updated 3 weeks ago
- OpenAI compatible API for TensorRT LLM triton backend☆218Updated last year
- Serverless LLM Serving for Everyone.☆631Updated this week
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…☆1,171Updated 3 months ago
- ☆60Updated last year
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆251Updated this week
- The Triton TensorRT-LLM Backend☆912Updated this week