hao-ai-lab / MuxServe
View external linksLinks

☆85

Alternatives and similar repositories for MuxServe

Users that are interested in MuxServe are comparing it to the libraries listed below

Sorting:

Hsword / SpotServe
View on GitHub
SpotServe: Serving Generative Large Language Models on Preemptible Instances
☆135Feb 22, 2024Updated last year
LoongServe / LoongServe
View on GitHub
☆131Nov 11, 2024Updated last year
LLMServe / DistServe
View on GitHub
Disaggregated serving system for Large Language Models (LLMs).
☆776Apr 6, 2025Updated 10 months ago
hao-ai-lab / vllm-ltr
View on GitHub
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆70Nov 4, 2024Updated last year
microsoft / sarathi-serve
View on GitHub
A low-latency & high-throughput serving engine for LLMs
☆474Jan 8, 2026Updated last month
microsoft / vidur
View on GitHub
A large-scale simulation framework for LLM inference
☆535Jul 25, 2025Updated 6 months ago
Infini-AI-Lab / MagicDec
View on GitHub
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆142Dec 4, 2024Updated last year
project-etalon / etalon
View on GitHub
LLM Serving Performance Evaluation Harness
☆83Feb 25, 2025Updated 11 months ago
EfficientLLMSys / MuxServe
View on GitHub
☆15Jun 26, 2024Updated last year
UChi-JCL / CacheGen
View on GitHub
☆151Oct 9, 2024Updated last year
ServerlessLLM / ServerlessLLM
View on GitHub
Serverless LLM Serving for Everyone.
☆649Jan 23, 2026Updated 3 weeks ago
snu-comparch / InfiniGen
View on GitHub
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)
☆174Jul 10, 2024Updated last year
efeslab / Nanoflow
View on GitHub
A throughput-oriented high-performance serving framework for LLMs
☆945Oct 29, 2025Updated 3 months ago
microsoft / ParrotServe
View on GitHub
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆209Sep 21, 2024Updated last year
tonyzhao-jt / LLM-PQ
View on GitHub
Official Repo for "SplitQuant / LLM-PQ: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and …
☆36Aug 29, 2025Updated 5 months ago
pkusys / ElasticFlow
View on GitHub
Artifacts for our ASPLOS'23 paper ElasticFlow
☆55May 10, 2024Updated last year
interestingLSY / swiftLLM
View on GitHub
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆314Jun 10, 2025Updated 8 months ago
AlibabaPAI / llumnix
View on GitHub
Efficient and easy multi-instance LLM serving
☆527Sep 3, 2025Updated 5 months ago
linxihui / dkernel
View on GitHub
☆21Apr 17, 2025Updated 9 months ago
EfficientMoE / MoE-Infinity
View on GitHub
PyTorch library for cost-effective, fast and easy serving of MoE models.
☆280Feb 2, 2026Updated 2 weeks ago
feifeibear / ChituAttention
View on GitHub
Quantized Attention on GPU
☆44Nov 22, 2024Updated last year
microsoft / glinthawk
View on GitHub
An LLM inference engine, written in C++
☆18Feb 5, 2026Updated last week
linhongseba / MaximumClique
View on GitHub
The implementation for maximum clique enumeration algorithm
☆11Apr 14, 2016Updated 9 years ago
uwsampl / paper-agents
View on GitHub
☆13Dec 9, 2024Updated last year
andy-yang-1 / DoubleSparse
View on GitHub
16-fold memory access reduction with nearly no loss
☆110Mar 26, 2025Updated 10 months ago
microsoft / vattention
View on GitHub
Dynamic Memory Management for Serving LLMs without PagedAttention
☆458May 30, 2025Updated 8 months ago
eth-easl / orion
View on GitHub
An interference-aware scheduler for fine-grained GPU sharing
☆159Nov 26, 2025Updated 2 months ago
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆812Mar 6, 2025Updated 11 months ago
cchan / nanoGPT-fp8
View on GitHub
☆13May 8, 2023Updated 2 years ago
tensorchord / llmspec
View on GitHub
OpenAI compatible API for open source LLMs
☆16Oct 30, 2023Updated 2 years ago
apple / parameterized-transforms
View on GitHub
torchvision-based transforms that provide access to parameterization
☆16Dec 4, 2025Updated 2 months ago
kyutai-labs / jax-flash-attn3
View on GitHub
JAX bindings for the flash-attention3 kernels
☆20Jan 2, 2026Updated last month
LLMServe / dLoRA-artifact
View on GitHub
☆29May 28, 2024Updated last year
Blaok / fpga-runtime
View on GitHub
☆13Aug 1, 2024Updated last year
MachineLearningSystem / 25ASPLOS-Medusa
View on GitHub
Medusa: Accelerating Serverless LLM Inference with Materialization [ASPLOS'25]
☆12Nov 8, 2024Updated last year
fw-ai / llama-cuda-graph-example
View on GitHub
Example of applying CUDA graphs to LLaMA-v2
☆12Aug 25, 2023Updated 2 years ago
hao-ai-lab / Consistency_LLM
View on GitHub
[ICML 2024] CLLMs: Consistency Large Language Models
☆411Nov 16, 2024Updated last year
Thesys-lab / Helix-ASPLOS25
View on GitHub
Open-source implementation for "Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow"
☆76Oct 15, 2025Updated 4 months ago
pkusys / TGS
View on GitHub
Artifacts for our NSDI'23 paper TGS
☆96Jun 10, 2024Updated last year

hao-ai-lab / MuxServeView external linksLinks

Alternatives and similar repositories for MuxServe

hao-ai-lab / MuxServe
View external linksLinks