LMCache / demoLinks
☆27Updated 9 months ago
Alternatives and similar repositories for demo
Users that are interested in demo are comparing it to the libraries listed below
Sorting:
- Modular and structured prompt caching for low-latency LLM inference☆110Updated last year
- ☆48Updated last year
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆207Updated last year
- ☆19Updated 7 months ago
- Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.☆113Updated 2 weeks ago
- ☆81Updated 3 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆136Updated last year
- LLM Serving Performance Evaluation Harness☆82Updated 10 months ago
- Stateful LLM Serving☆94Updated 10 months ago
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation☆31Updated last year
- The driver for LMCache core to run in vLLM☆58Updated 11 months ago
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆79Updated 2 weeks ago
- ☆73Updated 4 months ago
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆67Updated last year
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆68Updated 4 months ago
- SpotServe: Serving Generative Large Language Models on Preemptible Instances☆134Updated last year
- ☆92Updated last year
- Efficient Compute-Communication Overlap for Distributed LLM Inference☆68Updated 2 months ago
- Easy, Fast, and Scalable Multimodal AI☆92Updated this week
- ☆162Updated 6 months ago
- DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit☆91Updated this week
- [Archived] For the latest updates and community contribution, please visit: https://github.com/Ascend/TransferQueue or https://gitcode.co…☆13Updated this week
- DS SERVE: The Largest Open Vector Store over Pretain Data; A Framework for Efficient and Scalable Neural Retrieval☆37Updated last month
- An early research stage expert-parallel load balancer for MoE models based on linear programming.☆485Updated last month
- ☆74Updated last year
- A NCCL extension library, designed to efficiently offload GPU memory allocated by the NCCL communication library.☆80Updated last month
- ☆147Updated last year
- Bamboo-7B Large Language Model☆93Updated last year
- APEX+ is an LLM Serving Simulator☆41Updated 7 months ago
- ☆96Updated 9 months ago