vllm-project / speculatorsLinks

A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM

☆140

Alternatives and similar repositories for speculators

Users that are interested in speculators are comparing it to the libraries listed below

Sorting:

snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆327Updated this week
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆81Updated 9 months ago
neuralmagic / AutoFP8
☆205Updated 7 months ago
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆267Updated last year
vllm-project / dashboard
vLLM performance dashboard
☆38Updated last year
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆145Updated 10 months ago
snowflakedb / ArcticTraining
ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)
☆257Updated this week
vllm-project / recipes
Common recipes to run vLLM
☆256Updated this week
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆132Updated last year
sgl-project / genai-bench
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…
☆234Updated last week
fw-ai / benchmark
Benchmark suite for LLMs from Fireworks.ai
☆84Updated last week
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆214Updated this week
bentoml / llm-bench
☆56Updated last year
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆203Updated 5 months ago
anyscale / llm-continuous-batching-benchmarks
☆122Updated last year
hao-ai-lab / Dynasor
[NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.
☆209Updated 6 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆86Updated last year
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆243Updated last year
vllm-project / tpu-inference
TPU inference for vLLM, with unified JAX and PyTorch support.
☆170Updated last week
InternLM / turbomind
☆97Updated 8 months ago
cornserve-ai / cornserve
Easy, Fast, and Scalable Multimodal AI
☆76Updated last week
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆392Updated 5 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆170Updated last year
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
deepspeedai / DeepSpeed-Kernels
☆71Updated 8 months ago
MDK8888 / vllmini
A minimal implementation of vllm.
☆62Updated last year
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆391Updated last year
yale-sys / prompt-cache
Modular and structured prompt caching for low-latency LLM inference
☆103Updated last year
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆217Updated last week
LMCache / lmcache-vllm
The driver for LMCache core to run in vLLM
☆58Updated 10 months ago