vllm-project / speculatorsLinks
A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
☆190Updated this week
Alternatives and similar repositories for speculators
Users that are interested in speculators are comparing it to the libraries listed below
Sorting:
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆368Updated last week
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆269Updated this week
- KV cache compression for high-throughput LLM inference☆148Updated 11 months ago
- vLLM performance dashboard☆39Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMs☆267Updated last month
- ☆206Updated 8 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆136Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on disk☆233Updated this week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆90Updated last year
- LLM Serving Performance Evaluation Harness☆82Updated 10 months ago
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆251Updated last week
- Benchmark suite for LLMs from Fireworks.ai☆84Updated last month
- Efficient LLM Inference over Long Sequences☆393Updated 6 months ago
- Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face support☆245Updated this week
- Easy and Efficient Quantization for Transformers☆202Updated 6 months ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆118Updated last year
- ☆123Updated last year
- Code for data-aware compression of DeepSeek models☆68Updated last month
- Common recipes to run vLLM☆335Updated this week
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆218Updated this week
- ☆96Updated 9 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆333Updated last year
- A minimal implementation of vllm.☆65Updated last year
- ☆219Updated 11 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆175Updated last year
- Applied AI experiments and examples for PyTorch☆312Updated 4 months ago
- An early research stage expert-parallel load balancer for MoE models based on linear programming.☆485Updated last month
- LM engine is a library for pretraining/finetuning LLMs☆108Updated this week
- Easy, Fast, and Scalable Multimodal AI☆89Updated last week
- TPU inference for vLLM, with unified JAX and PyTorch support.☆213Updated this week