project-etalon / etalonLinks

LLM Serving Performance Evaluation Harness

☆81

Alternatives and similar repositories for etalon

Users that are interested in etalon are comparing it to the libraries listed below

Sorting:

tyler-griggs / melange-release
☆48Updated last year
WukLab / preble
Stateful LLM Serving
☆89Updated 8 months ago
Hsword / SpotServe
SpotServe: Serving Generative Large Language Models on Preemptible Instances
☆132Updated last year
yale-sys / prompt-cache
Modular and structured prompt caching for low-latency LLM inference
☆103Updated last year
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆199Updated last year
microsoft / chunk-attention
☆82Updated 7 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆132Updated last year
hao-ai-lab / MuxServe
☆79Updated last month
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆450Updated last month
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆243Updated last year
EfficientMoE / MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
☆262Updated last month
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆65Updated last year
zhengzangw / Sequence-Scheduling
PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".
☆94Updated 2 years ago
LoongServe / LoongServe
☆124Updated last year
UChi-JCL / CacheGen
☆141Updated last year
sgl-project / genai-bench
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…
☆234Updated last week
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆224Updated 2 years ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆170Updated last year
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆331Updated last year
anyscale / llm-continuous-batching-benchmarks
☆122Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆86Updated last year
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆327Updated this week
vllm-project / speculators
A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
☆140Updated this week
MDK8888 / vllmini
A minimal implementation of vllm.
☆62Updated last year
argonne-lcf / LLM-Inference-Bench
LLM-Inference-Bench
☆57Updated 4 months ago
SymbioticLab / Oobleck
A resilient distributed training framework
☆96Updated last year
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆145Updated 10 months ago
InternLM / turbomind
☆97Updated 8 months ago
microsoft / vidur
A large-scale simulation framework for LLM inference
☆488Updated 4 months ago
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆63Updated 2 months ago