flexflow / flexflow-serveLinks

FlexFlow Serve: Low-Latency, High-Performance LLM Serving

☆48

Alternatives and similar repositories for flexflow-serve

Users that are interested in flexflow-serve are comparing it to the libraries listed below

Sorting:

infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆154Updated last month
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆128Updated 6 months ago
hao-ai-lab / MuxServe
☆65Updated last year
WukLab / preble
Stateful LLM Serving
☆77Updated 4 months ago
LoongServe / LoongServe
☆109Updated 8 months ago
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆114Updated this week
flashinfer-ai / cutlass-viz
☆60Updated 3 months ago
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆103Updated 2 months ago
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆170Updated 10 months ago
alibaba / llm-scheduling-artifact
Artifact of OSDI '24 paper, ”Llumnix: Dynamic Scheduling for Large Language Model Serving“
☆62Updated last year
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆56Updated last week
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 6 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆70Updated 2 months ago
shenh10 / DeepSeek_Simulator
☆81Updated 4 months ago
EfficientMoE / MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
☆215Updated 3 weeks ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆91Updated 2 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆216Updated last year
CalebDu / Awesome-Cute
☆89Updated 2 months ago
AlibabaPAI / FLASHNN
☆96Updated 10 months ago
Infrawaves / DeepEP_ibrc_dual-ports_multiQP
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
☆57Updated 2 months ago
zartbot / shallowsim
DeepSeek-V3/R1 inference performance simulator
☆156Updated 4 months ago
alibaba / easydist
Automated Parallelization System and Infrastructure for Multiple Ecosystems
☆79Updated 8 months ago
ovg-project / kvcached
kvcached: Elastic KV cache for dynamic GPU sharing and efficient multi-LLM inference.
☆33Updated this week
yifuwang / symm-mem-recipes
☆101Updated 7 months ago
Hsword / SpotServe
SpotServe: Serving Generative Large Language Models on Preemptible Instances
☆125Updated last year
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆62Updated last month
microsoft / SparTA
☆150Updated last year
gty111 / gLLM
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
☆36Updated this week
ppl-ai / pplx-kernels
Perplexity GPU Kernels
☆413Updated 2 weeks ago
HPMLL / NVIDIA-Hopper-Benchmark
☆50Updated 2 months ago