ModelEngine-Group / unified-cache-managementLinks

Persist and reuse KV Cache to speedup your LLM.

☆132

Alternatives and similar repositories for unified-cache-management

Users that are interested in unified-cache-management are comparing it to the libraries listed below

Sorting:

AlibabaPAI / llumnix
Efficient and easy multi-instance LLM serving
☆510Updated 2 months ago
antgroup / glake
GLake: optimizing GPU memory management and IO transmission.
☆489Updated 7 months ago
bytedance / InfiniStore
KV cache store for distributed LLM inference
☆361Updated last week
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆142Updated 10 months ago
LLMServe / DistServe
Disaggregated serving system for Large Language Models (LLMs).
☆729Updated 7 months ago
zartbot / shallowsim
DeepSeek-V3/R1 inference performance simulator
☆168Updated 7 months ago
WukLab / preble
Stateful LLM Serving
☆88Updated 8 months ago
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆115Updated 6 months ago
hahnyuan / LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…
☆578Updated last year
sgl-project / sgl-learning-materials
Materials for learning SGLang
☆650Updated this week
sihyeong / Awesome-LLM-Inference-Engine
☆150Updated 5 months ago
stepfun-ai / StepMesh
☆316Updated last week
ovg-project / kvcached
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
☆661Updated 2 weeks ago
galeselee / Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…
☆279Updated 8 months ago
DeepLink-org / deeplink.framework
☆72Updated last year
ai-dynamo / nixl
NVIDIA Inference Xfer Library (NIXL)
☆721Updated this week
sgl-project / sgl-kernel-npu
SGLang kernel library for NPU
☆73Updated this week
sgl-project / SpecForge
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
☆483Updated this week
flagos-ai / FlagScale
FlagScale is a large model toolkit based on open-sourced projects.
☆407Updated last week
LoongServe / LoongServe
☆124Updated last year
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆445Updated last month
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆436Updated 5 months ago
shenh10 / DeepSeek_Simulator
☆90Updated 7 months ago
Hsword / SpotServe
SpotServe: Serving Generative Large Language Models on Preemptible Instances
☆132Updated last year
DeepLink-org / DIOPI
☆75Updated last year
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆119Updated last year
omni-ai-npu / omni-infer
Omni_Infer is a suite of inference accelerators designed for the Ascend NPU platform, offering native support and an expanding feature se…
☆86Updated last week
hao-ai-lab / MuxServe
☆79Updated last month
Tencent / KsanaLLM
☆512Updated 2 months ago
AlibabaPAI / torchacc
PyTorch distributed training acceleration framework
☆53Updated 3 months ago