☆150Oct 9, 2024Updated last year
Alternatives and similar repositories for CacheGen
Users that are interested in CacheGen are comparing it to the libraries listed below
Sorting:
- ☆165Jul 15, 2025Updated 7 months ago
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)☆174Jul 10, 2024Updated last year
- ☆131Nov 11, 2024Updated last year
- Artifact for "Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving" [SOSP '24]☆24Nov 21, 2024Updated last year
- Disaggregated serving system for Large Language Models (LLMs).☆777Apr 6, 2025Updated 10 months ago
- A low-latency & high-throughput serving engine for LLMs☆480Jan 8, 2026Updated last month
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆91Updated this week
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆210Sep 21, 2024Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆144Dec 4, 2024Updated last year
- ☆86Oct 17, 2025Updated 4 months ago
- Stateful LLM Serving☆96Mar 11, 2025Updated 11 months ago
- A large-scale simulation framework for LLM inference☆539Jul 25, 2025Updated 7 months ago
- Official Repo for "SplitQuant / LLM-PQ: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and …☆36Aug 29, 2025Updated 6 months ago
- Efficient and easy multi-instance LLM serving☆527Sep 3, 2025Updated 5 months ago
- ☆47Jun 27, 2024Updated last year
- ☆301Jul 10, 2025Updated 7 months ago
- [ICML‘2024] "LoCoCo: Dropping In Convolutions for Long Context Compression", Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen☆17Sep 7, 2024Updated last year
- Supercharge Your LLM with the Fastest KV Cache Layer☆6,923Updated this week
- The driver for LMCache core to run in vLLM☆61Feb 4, 2025Updated last year
- High performance Transformer implementation in C++.☆152Jan 18, 2025Updated last year
- ☆23Sep 17, 2024Updated last year
- KV cache compression for high-throughput LLM inference☆154Feb 5, 2025Updated last year
- ☆49Aug 27, 2024Updated last year
- SpotServe: Serving Generative Large Language Models on Preemptible Instances☆135Feb 22, 2024Updated 2 years ago
- QAQ: Quality Adaptive Quantization for LLM KV Cache☆55Mar 27, 2024Updated last year
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆284Updated this week
- Systematic and comprehensive benchmarks for LLM systems.☆51Jan 28, 2026Updated last month
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆51Oct 18, 2024Updated last year
- ☆15Jan 7, 2023Updated 3 years ago
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆358Nov 20, 2025Updated 3 months ago
- LLM serving cluster simulator☆135Apr 25, 2024Updated last year
- 📰 Must-read papers on KV Cache Compression (constantly updating 🤗).☆659Sep 30, 2025Updated 5 months ago
- Hairpin: Rethinking Packet Loss Recovery in Edge-based Interactive Video Streaming (NSDI 2024)☆25Mar 5, 2025Updated 11 months ago
- ☆21Apr 17, 2025Updated 10 months ago
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆71Nov 4, 2024Updated last year
- The Official Implementation of Ada-KV [NeurIPS 2025]☆128Nov 26, 2025Updated 3 months ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆314Jun 10, 2025Updated 8 months ago
- Accelerating GPU Data Processing using FastLanes Compression☆16May 9, 2024Updated last year
- ☆85Apr 18, 2025Updated 10 months ago