A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
☆391May 1, 2026Updated this week
Alternatives and similar repositories for speculators
Users that are interested in speculators are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A safetensors extension to efficiently store sparse quantized tensors on disk☆275Updated this week
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆3,190Updated this week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆816Apr 2, 2026Updated last month
- vLLM adapter for a TGIS-compatible gRPC server.☆55Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆266Dec 4, 2025Updated 5 months ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- 3x Faster Inference; Unofficial implementation of EAGLE Speculative Decoding☆83Jul 3, 2025Updated 10 months ago
- A fast, local, and secure approach for training LLMs for coding tasks using GRPO with WebAssembly and interpreter feedback.☆42Apr 4, 2025Updated last year
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆426Apr 23, 2026Updated 2 weeks ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).☆2,313Feb 20, 2026Updated 2 months ago
- Bagua tutorials.☆13Sep 4, 2022Updated 3 years ago
- The Soft Cosine Measure system developed for the ARQMath-3 shared task evaluation of math information retrieval systems☆13Sep 8, 2022Updated 3 years ago
- ☆47Nov 10, 2023Updated 2 years ago
- Achieve state of the art inference performance with modern accelerators on Kubernetes☆3,107Updated this week
- FlashSampling: Fast and Memory-Efficient Exact Sampling (https://huggingface.co/papers/2603.15854)☆70Apr 25, 2026Updated last week
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- ☆12Mar 8, 2022Updated 4 years ago
- Memory optimized Mixture of Experts☆75Jul 25, 2025Updated 9 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆1,065Sep 4, 2024Updated last year
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆294Apr 23, 2026Updated last week
- Common recipes to run vLLM☆772Updated this week
- KV Cache & LoRA for minGPT☆62Mar 4, 2026Updated 2 months ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆155Aug 21, 2025Updated 8 months ago
- Distributed SDDMM Kernel☆12Jul 8, 2022Updated 3 years ago
- A high-performance and light-weight router for vLLM large scale deployment☆214Apr 30, 2026Updated last week
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆338Jul 2, 2024Updated last year
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆1,086Apr 29, 2026Updated last week
- A CUDA kernel optimization toolkit for validation, benchmarking, Nsight Compute profiling, bottleneck analysis, and iterative tuning. It …☆146Apr 22, 2026Updated 2 weeks ago
- NVIDIA Inference Xfer Library (NIXL)☆1,011Apr 30, 2026Updated last week
- Efficient LLM Inference over Long Sequences☆394Jun 25, 2025Updated 10 months ago
- Benchmark and optimize LLM inference across frameworks with ease☆183Sep 12, 2025Updated 7 months ago
- Longitudinal Evaluation of LLMs via Data Compression☆33May 29, 2024Updated last year
- 这是一个从零学习CUDA课程☆13Nov 3, 2024Updated last year
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆835Mar 6, 2025Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆43Jan 15, 2024Updated 2 years ago
- Materials for learning SGLang☆808Jan 5, 2026Updated 4 months ago
- Cloud Native Benchmarking of Foundation Models☆45Jul 31, 2025Updated 9 months ago
- A throughput-oriented high-performance serving framework for LLMs☆956Mar 29, 2026Updated last month
- FlashInfer: Kernel Library for LLM Serving☆5,544Updated this week
- ☆53Feb 19, 2024Updated 2 years ago
- Tile-based language built for AI computation across all scales☆143Mar 27, 2026Updated last month