jd-opensource / xllm-serviceLinks

A flexible serving framework that delivers efficient and fault-tolerant LLM inference for clustered deployments.

☆56

Alternatives and similar repositories for xllm-service

Users that are interested in xllm-service are comparing it to the libraries listed below

Sorting:

TreeAI-Lab / Awesome-KV-Cache-Management
This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding co…
☆216Updated 2 months ago
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆64Updated last year
Gaffey / ExCP
Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".
☆48Updated last year
PKU-DAIR / Hetu
A high-performance distributed deep learning system targeting large-scale and automated distributed training.
☆323Updated 2 months ago
microsoft / chunk-attention
☆78Updated 5 months ago
madsys-dev / deepseekv2-profile
☆148Updated 7 months ago
YaoJiayi / CacheBlend
☆139Updated 2 months ago
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆118Updated 2 weeks ago
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆151Updated this week
stepfun-ai / StepMesh
☆300Updated last week
FlagOpen / FlagScale
FlagScale is a large model toolkit based on open-sourced projects.
☆358Updated last week
thu-pacman / SmartMoE-AE
ATC23 AE
☆47Updated 2 years ago
lhb8125 / Megatron-LM
Ongoing research training transformer models at scale
☆18Updated last week
LLMServe / DistServe
Disaggregated serving system for Large Language Models (LLMs).
☆700Updated 6 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆135Updated 8 months ago
HugoZHL / PQCache
[SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference
☆71Updated last week
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆105Updated 6 months ago
ByteDance-Seed / ByteCheckpoint
ByteCheckpoint: An Unified Checkpointing Library for LFMs
☆249Updated 3 months ago
Relaxed-System-Lab / HexGen
[ICML 2024] Serving LLMs on heterogeneous decentralized clusters.
☆30Updated last year
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆429Updated 5 months ago
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆59Updated 11 months ago
Victarry / PP-Schedule-Visualization
Pipeline Parallelism Emulation and Visualization
☆67Updated 4 months ago
CalvinXKY / mfu_calculation
A simple calculation for LLM MFU.
☆46Updated last month
zhengzangw / Sequence-Scheduling
PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".
☆90Updated 2 years ago
sgl-project / SpecForge
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
☆417Updated this week
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆111Updated 4 months ago
snu-comparch / InfiniGen
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)
☆154Updated last year
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆72Updated 4 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆179Updated 3 weeks ago
thu-pacman / FasterMoE
☆88Updated 3 years ago