Efficient and easy multi-instance LLM serving
☆549Mar 12, 2026Updated 2 months ago
Alternatives and similar repositories for llumnix-ray
Users that are interested in llumnix-ray are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Disaggregated serving system for Large Language Models (LLMs).☆808Apr 6, 2025Updated last year
- A low-latency & high-throughput serving engine for LLMs☆497Jan 8, 2026Updated 4 months ago
- A throughput-oriented high-performance serving framework for LLMs☆959Mar 29, 2026Updated last month
- Dynamic Memory Management for Serving LLMs without PagedAttention☆483May 30, 2025Updated 11 months ago
- KV cache store for distributed LLM inference☆417Nov 13, 2025Updated 6 months ago
- End-to-end encrypted cloud storage - Proton Drive • AdSpecial offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
- Artifact of OSDI '24 paper, ”Llumnix: Dynamic Scheduling for Large Language Model Serving“☆64Jun 5, 2024Updated last year
- ☆133Nov 11, 2024Updated last year
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.☆5,339Updated this week
- Accurate, large-scale, and extensible simulator for LLM inference Systems☆601Jul 25, 2025Updated 9 months ago
- SpotServe: Serving Generative Large Language Models on Preemptible Instances☆134Feb 22, 2024Updated 2 years ago
- FlashInfer: Kernel Library for LLM Serving☆5,621Updated this week
- RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.☆1,115Updated this week
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.☆1,304Aug 28, 2025Updated 8 months ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆217Sep 21, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- GLake: optimizing GPU memory management and IO transmission.