vllm-project / guidellmLinks
Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
☆655Updated this week
Alternatives and similar repositories for guidellm
Users that are interested in guidellm are comparing it to the libraries listed below
Sorting:
- LLMPerf is a library for validating and benchmarking LLMs☆1,032Updated 10 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆2,106Updated this week
- ☆302Updated last week
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆283Updated this week
- A throughput-oriented high-performance serving framework for LLMs☆904Updated last month
- A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM☆60Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆266Updated last year
- ☆258Updated this week
- Comparison of Language Model Inference Engines☆231Updated 10 months ago
- OpenAI compatible API for TensorRT LLM triton backend☆215Updated last year
- Inference server benchmarking tool☆118Updated 3 weeks ago
- ☆114Updated 2 weeks ago
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆318Updated 3 weeks ago
- ☆58Updated last year
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.☆668Updated this week
- An innovative library for efficient LLM inference via low-bit quantization☆349Updated last year
- Serverless LLM Serving for Everyone.☆561Updated this week
- Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond☆104Updated this week
- Fast, Flexible and Portable Structured Generation☆1,309Updated last week
- Achieve state of the art inference performance with modern accelerators on Kubernetes☆1,907Updated this week
- ☆56Updated 11 months ago
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆842Updated this week
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…☆1,141Updated 3 weeks ago
- Common recipes to run vLLM☆172Updated this week
- vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization☆1,864Updated last week
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆659Updated 5 months ago
- Materials for learning SGLang☆615Updated 3 weeks ago
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆292Updated this week
- The Triton TensorRT-LLM Backend☆901Updated last week
- A collection of all available inference solutions for the LLMs☆91Updated 7 months ago