microsoft / apex_plusLinks

APEX+ is an LLM Serving Simulator

☆41

Alternatives and similar repositories for apex_plus

Users that are interested in apex_plus are comparing it to the libraries listed below

Sorting:

WukLab / preble
Stateful LLM Serving
☆93Updated 10 months ago
gty111 / gLLM
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
☆52Updated this week
NEO-MLSys25 / NEO
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
☆77Updated 6 months ago
hao-ai-lab / MuxServe
☆81Updated 2 months ago
Thesys-lab / Helix-ASPLOS25
Open-source implementation for "Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow"
☆75Updated 2 months ago
sspec-project / SparseSpec
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
☆74Updated last month
LoongServe / LoongServe
☆130Updated last year
Hsword / SpotServe
SpotServe: Serving Generative Large Language Models on Preemptible Instances
☆134Updated last year
Multi-LLM / prism-research
Research prototype of PRISM — a cost-efficient multi-LLM serving system with flexible time- and space-based GPU sharing.
☆51Updated 4 months ago
kvcache-ai / TrEnv-X
☆72Updated 3 months ago
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆66Updated last year
YaoJiayi / CacheBlend
☆160Updated 5 months ago
xinhao-luo / ClusterFusion
[NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
☆60Updated last month
microsoft / RetrievalAttention
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
☆110Updated last week
alibaba / llm-scheduling-artifact
Artifact of OSDI '24 paper, ”Llumnix: Dynamic Scheduling for Large Language Model Serving“
☆64Updated last year
dywsjtu / apparate
Artifact for "Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving" [SOSP '24]
☆25Updated last year
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆148Updated 11 months ago
Relaxed-System-Lab / HexGen
[ICML 2024] Serving LLMs on heterogeneous decentralized clusters.
☆34Updated last year
thustorage / Medusa
Medusa: Accelerating Serverless LLM Inference with Materialization [ASPLOS'25]
☆40Updated 7 months ago
google / iopddl
Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning
☆24Updated 7 months ago
hongzhangblaze / CS854-F24
☆54Updated 3 months ago
microsoft / tokenweave
Efficient Compute-Communication Overlap for Distributed LLM Inference
☆67Updated 2 months ago
eddiegaoo / Apt-Serve
☆19Updated 7 months ago
Hsword / Awesome-Machine-Learning-System-Papers
☆79Updated 3 years ago
Raphael-Hao / brainstorm
Compiler for Dynamic Neural Networks
☆46Updated 2 years ago
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆123Updated 2 weeks ago
ByteDance-Seed / StragglerAnalysis
☆49Updated 8 months ago
JF-D / Parcae
☆21Updated last year
flashinfer-ai / flashinfer-bench
Building the Virtuous Cycle for AI-driven LLM Systems
☆112Updated this week
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆206Updated last year