Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
☆730Mar 14, 2026Updated last week
Alternatives and similar repositories for SpecForge
Users that are interested in SpecForge are comparing it to the libraries listed below
Sorting:
- A Rust reimplementation of genai-bench for benchmarking LLM serving systems at high concurrency with accurate timing and industry-standar…☆279Updated this week
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).☆2,229Feb 20, 2026Updated last month
- FlashInfer: Kernel Library for LLM Serving☆5,145Mar 15, 2026Updated last week
- Materials for learning SGLang☆775Jan 5, 2026Updated 2 months ago
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…☆1,198Mar 9, 2026Updated last week
- NVIDIA Inference Xfer Library (NIXL)☆945Updated this week
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)☆372Apr 22, 2025Updated 11 months ago
- Perplexity GPU Kernels☆564Nov 7, 2025Updated 4 months ago
- SGLang is a high-performance serving framework for large language models and multimodal models.☆24,829Updated this week
- ☆65Apr 26, 2025Updated 10 months ago
- My learning notes for ML SYS.☆5,737Updated this week
- slime is an LLM post-training framework for RL Scaling.☆4,799Updated this week
- Distributed Compiler based on Triton for Parallel Systems☆1,386Mar 11, 2026Updated last week
- DFlash: Block Diffusion for Flash Speculative Decoding☆634Mar 15, 2026Updated last week
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.☆4,953Updated this week
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆2,891Updated this week
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…☆3,958Updated this week
- Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels☆5,403Updated this week
- A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresse…☆2,156Mar 15, 2026Updated last week
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆818Mar 6, 2025Updated last year
- A Datacenter Scale Distributed Inference Serving Framework☆6,347Updated this week
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding☆277Aug 31, 2024Updated last year
- Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, T…☆397Updated this week
- ☆207May 5, 2025Updated 10 months ago
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.☆1,273Aug 28, 2025Updated 6 months ago
- ☆19Dec 24, 2024Updated last year
- LLM KV cache compression made easy☆971Mar 13, 2026Updated last week
- Bridge Megatron-Core to Hugging Face/Reinforcement Learning☆201Mar 13, 2026Updated last week
- 📰 Must-read papers and blogs on Speculative Decoding ⚡️☆1,145Mar 9, 2026Updated last week
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads☆2,719Jun 25, 2024Updated last year
- A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM☆285Updated this week
- A Quirky Assortment of CuTe Kernels☆861Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆531Feb 10, 2025Updated last year
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆170Feb 11, 2026Updated last month
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆925Feb 28, 2026Updated 3 weeks ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆148Dec 23, 2025Updated 2 months ago
- VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo☆1,745Updated this week
- (best/better) practices of megatron on veRL and tuning guide☆132Sep 26, 2025Updated 5 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆377Jul 10, 2025Updated 8 months ago