vllm-project / production-stack
vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization
☆877Updated last week
Alternatives and similar repositories for production-stack:
Users that are interested in production-stack are comparing it to the libraries listed below
- Redis for LLMs☆653Updated this week
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆1,113Updated this week
- LLMPerf is a library for validating and benchmarking LLMs☆835Updated 3 months ago
- A throughput-oriented high-performance serving framework for LLMs☆782Updated 6 months ago
- FlashInfer: Kernel Library for LLM Serving☆2,483Updated this week
- Fast, Flexible and Portable Structured Generation☆818Updated this week
- Serverless LLM Serving for Everyone.☆441Updated this week
- [NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which r…☆945Updated last month
- Efficient and easy multi-instance LLM serving☆348Updated this week
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆229Updated this week
- Materials for learning SGLang☆355Updated this week
- A Datacenter Scale Distributed Inference Serving Framework☆3,122Updated this week
- Large Language Model (LLM) Systems Paper List☆829Updated this week
- My learning notes/codes for ML SYS.☆1,545Updated this week
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆1,336Updated this week
- Cost-efficient and pluggable Infrastructure components for GenAI inference☆3,290Updated this week
- LLM KV cache compression made easy☆442Updated last week
- The Triton TensorRT-LLM Backend☆809Updated last week
- Efficient LLM Inference over Long Sequences☆365Updated last month
- Minimalistic large language model 3D-parallelism training☆1,715Updated this week
- MoBA: Mixture of Block Attention for Long-Context LLMs☆1,687Updated 3 weeks ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.☆1,094Updated this week
- An Open Source Toolkit For LLM Distillation☆554Updated 2 months ago
- A low-latency & high-throughput serving engine for LLMs☆327Updated last month
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆617Updated 3 weeks ago
- A PyTorch Native LLM Training Framework☆759Updated 3 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆617Updated last week
- Disaggregated serving system for Large Language Models (LLMs).☆507Updated 7 months ago
- 📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉☆3,700Updated 3 weeks ago
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.☆2,915Updated this week