neuralmagic / guidellmLinks

Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs

☆348

Alternatives and similar repositories for guidellm

Users that are interested in guidellm are comparing it to the libraries listed below

Sorting:

neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆264Updated 8 months ago
vllm-project / llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆1,518Updated this week
triton-inference-server / vllm_backend
☆267Updated last week
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆825Updated 2 weeks ago
intel / auto-round
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Tra…
☆499Updated this week
lapp0 / lm-inference-engines
Comparison of Language Model Inference Engines
☆217Updated 6 months ago
ray-project / llmperf
LLMPerf is a library for validating and benchmarking LLMs
☆940Updated 6 months ago
huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…
☆304Updated 3 weeks ago
ServerlessLLM / ServerlessLLM
Serverless LLM Serving for Everyone.
☆488Updated this week
facebookresearch / LayerSkip
Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024
☆311Updated last month
huggingface / inference-benchmarker
Inference server benchmarking tool
☆73Updated last month
microsoft / MInference
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,055Updated this week
ModelCloud / GPTQModel
Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…
☆625Updated this week
fw-ai / benchmark
Benchmark suite for LLMs from Fireworks.ai
☆76Updated 2 weeks ago
run-ai / runai-model-streamer
☆221Updated this week
run-ai / llmperf
☆55Updated 9 months ago
npuichigo / openai_trtllm
OpenAI compatible API for TensorRT LLM triton backend
☆209Updated 10 months ago
huggingface / yourbench
🤗 Benchmark Large Language Models Reliably On Your Data
☆329Updated this week
snowflakedb / ArcticInference
☆155Updated this week
mobiusml / hqq
Official implementation of Half-Quadratic Quantization (HQQ)
☆832Updated this week
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆845Updated 9 months ago
neuralmagic / AutoFP8
☆194Updated last month
intel / neural-speed
An innovative library for efficient LLM inference via low-bit quantization
☆349Updated 9 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆377Updated 2 weeks ago
arcee-ai / EvolKit
EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language M…
☆223Updated 7 months ago
NVIDIA / kvpress
LLM KV cache compression made easy
☆508Updated this week
bentoml / BentoVLLM
Self-host LLMs with vLLM and BentoML
☆120Updated this week
microsoft / VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
☆643Updated last month
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆705Updated 3 months ago
apoorvumang / prompt-lookup-decoding
☆541Updated 9 months ago