LMCache / demoLinks
☆20Updated 3 months ago
Alternatives and similar repositories for demo
Users that are interested in demo are comparing it to the libraries listed below
Sorting:
- The driver for LMCache core to run in vLLM☆45Updated 6 months ago
- Modular and structured prompt caching for low-latency LLM inference☆98Updated 9 months ago
- ☆47Updated last year
- LLM Serving Performance Evaluation Harness☆79Updated 5 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆123Updated 8 months ago
- ☆79Updated 8 months ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆172Updated 10 months ago
- ☆16Updated 2 months ago
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation☆30Updated 8 months ago
- ☆67Updated last year
- KV cache compression for high-throughput LLM inference☆134Updated 6 months ago
- ☆116Updated 10 months ago
- Stateful LLM Serving☆79Updated 4 months ago
- ☆40Updated 3 months ago
- ☆92Updated 4 months ago
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆55Updated 9 months ago
- ☆125Updated 3 weeks ago
- Simple extension on vLLM to help you speed up reasoning model without training.☆172Updated 2 months ago
- PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (KDD 2025)☆22Updated last year
- ☆78Updated 3 months ago
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆203Updated this week
- Self-host LLMs with LMDeploy and BentoML☆22Updated last month
- A simple calculation for LLM MFU.☆42Updated 5 months ago
- ☆29Updated 5 months ago
- Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.☆71Updated this week
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆128Updated last month
- Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automaton☆30Updated 5 months ago
- Hydragen: High-Throughput LLM Inference with Shared Prefixes☆41Updated last year
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆169Updated last week
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆55Updated 9 months ago