ruipeterpan / marconiLinks

Artifact for "Marconi: Prefix Caching for the Era of Hybrid LLMs" [MLSys '25 Outstanding Paper Award, Honorable Mention]

☆47

Alternatives and similar repositories for marconi

Users that are interested in marconi are comparing it to the libraries listed below

Sorting:

hao-ai-lab / MuxServe
☆79Updated last month
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆108Updated 8 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆83Updated 10 months ago
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆93Updated 5 months ago
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆65Updated last year
osayamenja / FlashMoE
Distributed MoE in a Single Kernel [NeurIPS '25]
☆149Updated last week
DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆82Updated this week
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆146Updated 2 months ago
WukLab / preble
Stateful LLM Serving
☆89Updated 8 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 7 months ago
NEO-MLSys25 / NEO
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
☆69Updated 5 months ago
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆356Updated 4 months ago
LoongServe / LoongServe
☆124Updated last year
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆142Updated 10 months ago
microsoft / chunk-attention
☆82Updated 7 months ago
microsoft / RetrievalAttention
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
☆104Updated 2 months ago
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆59Updated 8 months ago
snu-comparch / InfiniGen
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)
☆164Updated last year
microsoft / AttentionEngine
☆113Updated 6 months ago
ByteDance-Seed / cudaLLM
☆125Updated 3 months ago
d-matrix-ai / keyformer-llm
☆58Updated last year
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆190Updated last month
YaoJiayi / CacheBlend
☆153Updated 4 months ago
DD-DuDa / BitDecoding
[HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆66Updated 2 weeks ago
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆49Updated 4 months ago
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆64Updated 2 months ago
PanZaifeng / FastTree-Artifact
☆27Updated 8 months ago
flashinfer-ai / cutlass-viz
☆65Updated 7 months ago
meta-pytorch / KernelAgent
Autonomous GPU Kernel Generation via Deep Agents
☆172Updated this week
nex-agi / NexVenusCL
Nex Venus Communication Library
☆61Updated 3 weeks ago