WukLab / InferCeptLinks

☆30

Alternatives and similar repositories for InferCept

Users that are interested in InferCept are comparing it to the libraries listed below

Sorting:

microsoft / RetrievalAttention
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
☆94Updated last month
WukLab / preble
Stateful LLM Serving
☆87Updated 7 months ago
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆61Updated 11 months ago
HugoZHL / PQCache
[SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference
☆73Updated this week
hao-ai-lab / MuxServe
☆74Updated last week
alibaba / ServeGen
A framework for generating realistic LLM serving workloads
☆71Updated 2 weeks ago
YaoJiayi / CacheBlend
☆144Updated 3 months ago
SymbioticLab / Oobleck
A resilient distributed training framework
☆96Updated last year
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆188Updated last year
Hsword / SpotServe
SpotServe: Serving Generative Large Language Models on Preemptible Instances
☆130Updated last year
LoongServe / LoongServe
☆124Updated 11 months ago
dywsjtu / apparate
Artifact for "Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving" [SOSP '24]
☆25Updated 11 months ago
eddiegaoo / Apt-Serve
☆18Updated 4 months ago
kvcache-ai / TrEnv-X
☆63Updated last month
Thesys-lab / Helix-ASPLOS25
Open-source implementation for "Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow"
☆68Updated 2 weeks ago
NetX-lab / Ayo
[ASPLOS'25] Towards End-to-End Optimization of LLM-based Applications with Ayo
☆47Updated 2 months ago
casys-kaist / EnvPipe
☆25Updated 2 years ago
NEO-MLSys25 / NEO
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
☆67Updated 4 months ago
S-Lab-System-Group / Hydro
Surrogate-based Hyperparameter Tuning System
☆27Updated 2 years ago
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆48Updated 2 months ago
Ying1123 / VTC-artifact
☆35Updated last year
James-QiuHaoran / LLM-serving-with-proxy-models
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction | A tiny BERT model can tell you the verbosity of an …
☆47Updated last year
Multi-LLM / prism-research
Research prototype of PRISM — a cost-efficient multi-LLM serving system with flexible time- and space-based GPU sharing.
☆31Updated 2 months ago
ByteDance-Seed / StragglerAnalysis
☆42Updated 5 months ago
MaoZiming / papers
Paper-reading notes for Berkeley OS prelim exam.
☆14Updated last year
fpgasystems / Chameleon-RAG-Acceleration
☆19Updated 4 months ago
tonyzhao-jt / LLM-PQ
Official Repo for "SplitQuant / LLM-PQ: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and …
☆34Updated last month
zhengzangw / Sequence-Scheduling
PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".
☆92Updated 2 years ago
UChi-JCL / CacheGen
☆136Updated last year
awslabs / optimizing-multitask-training-through-dynamic-pipelines
Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines
☆20Updated last year