jd-opensource / xllm-serviceLinks
A flexible serving framework that delivers efficient and fault-tolerant LLM inference for clustered deployments.
☆56Updated 2 weeks ago
Alternatives and similar repositories for xllm-service
Users that are interested in xllm-service are comparing it to the libraries listed below
Sorting:
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding co…☆216Updated 2 months ago
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…☆64Updated last year
- Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".☆48Updated last year
- A high-performance distributed deep learning system targeting large-scale and automated distributed training.☆323Updated 2 months ago
- ☆78Updated 5 months ago
- ☆148Updated 7 months ago
- ☆139Updated 2 months ago
- nnScaler: Compiling DNN models for Parallel Training☆118Updated 2 weeks ago
- PyTorch bindings for CUTLASS grouped GEMM.☆151Updated this week
- ☆300Updated last week
- FlagScale is a large model toolkit based on open-sourced projects.☆358Updated last week
- ATC23 AE☆47Updated 2 years ago
- Ongoing research training transformer models at scale☆18Updated last week
- Disaggregated serving system for Large Language Models (LLMs).☆700Updated 6 months ago
- High performance Transformer implementation in C++.☆135Updated 8 months ago
- [SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference☆71Updated last week
- 16-fold memory access reduction with nearly no loss☆105Updated 6 months ago
- ByteCheckpoint: An Unified Checkpointing Library for LFMs☆249Updated 3 months ago
- [ICML 2024] Serving LLMs on heterogeneous decentralized clusters.☆30Updated last year
- Zero Bubble Pipeline Parallelism☆429Updated 5 months ago
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆59Updated 11 months ago
- Pipeline Parallelism Emulation and Visualization☆67Updated 4 months ago
- A simple calculation for LLM MFU.☆46Updated last month
- PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".☆90Updated 2 years ago
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆417Updated this week
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆111Updated 4 months ago
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)☆154Updated last year
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆72Updated 4 months ago
- A lightweight design for computation-communication overlap.☆179Updated 3 weeks ago
- ☆88Updated 3 years ago