LMCache / lmcache-vllm
The driver for LMCache core to run in vLLM
☆38Updated 3 months ago
Alternatives and similar repositories for lmcache-vllm:
Users that are interested in lmcache-vllm are comparing it to the libraries listed below
- LLM Serving Performance Evaluation Harness☆77Updated 2 months ago
- Stateful LLM Serving☆65Updated last month
- ☆45Updated 10 months ago
- ☆11Updated 2 weeks ago
- ☆104Updated 4 months ago
- KV cache store for distributed LLM inference☆165Updated this week
- ☆100Updated 6 months ago
- ☆84Updated last month
- Modular and structured prompt caching for low-latency LLM inference☆92Updated 5 months ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆155Updated 7 months ago
- ☆59Updated 10 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆115Updated 5 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆99Updated last year
- SpotServe: Serving Generative Large Language Models on Preemptible Instances☆118Updated last year
- A low-latency & high-throughput serving engine for LLMs☆351Updated 2 weeks ago
- ☆73Updated 2 weeks ago
- ☆53Updated 7 months ago
- ☆95Updated 5 months ago
- ☆25Updated 2 weeks ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆73Updated 8 months ago
- Fast and memory-efficient exact attention☆68Updated last week
- A lightweight design for computation-communication overlap.☆35Updated last week
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆34Updated this week
- ☆11Updated this week
- High performance Transformer implementation in C++.☆119Updated 3 months ago
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆47Updated 6 months ago
- ☆50Updated 5 months ago
- ☆34Updated 4 months ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆64Updated this week
- KV cache compression for high-throughput LLM inference☆124Updated 3 months ago