ModelEngine-Group / unified-cache-managementLinks
Persist and reuse KV Cache to speedup your LLM.
☆132Updated this week
Alternatives and similar repositories for unified-cache-management
Users that are interested in unified-cache-management are comparing it to the libraries listed below
Sorting:
- Efficient and easy multi-instance LLM serving☆510Updated 2 months ago
- GLake: optimizing GPU memory management and IO transmission.☆489Updated 7 months ago
- KV cache store for distributed LLM inference☆361Updated last week
- High performance Transformer implementation in C++.☆142Updated 10 months ago
- Disaggregated serving system for Large Language Models (LLMs).☆729Updated 7 months ago
- DeepSeek-V3/R1 inference performance simulator☆168Updated 7 months ago
- Stateful LLM Serving☆88Updated 8 months ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆115Updated 6 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…☆578Updated last year
- Materials for learning SGLang☆650Updated this week
- ☆150Updated 5 months ago
- ☆316Updated last week
- Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond☆661Updated 2 weeks ago
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆279Updated 8 months ago
- ☆72Updated last year
- NVIDIA Inference Xfer Library (NIXL)☆721Updated this week
- SGLang kernel library for NPU☆73Updated this week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆483Updated this week
- FlagScale is a large model toolkit based on open-sourced projects.☆407Updated last week
- ☆124Updated last year
- A low-latency & high-throughput serving engine for LLMs☆445Updated last month
- Dynamic Memory Management for Serving LLMs without PagedAttention☆436Updated 5 months ago
- ☆90Updated 7 months ago
- SpotServe: Serving Generative Large Language Models on Preemptible Instances☆132Updated last year
- ☆75Updated last year
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆119Updated last year
- Omni_Infer is a suite of inference accelerators designed for the Ascend NPU platform, offering native support and an expanding feature se…☆86Updated last week
- ☆79Updated last month
- ☆512Updated 2 months ago
- PyTorch distributed training acceleration framework☆53Updated 3 months ago