antgroup / sglangLinks
SGLang is a fast serving framework for large language models and vision language models.
☆18Updated this week
Alternatives and similar repositories for sglang
Users that are interested in sglang are comparing it to the libraries listed below
Sorting:
- KV cache store for distributed LLM inference☆345Updated last month
- Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond☆104Updated this week
- Materials for learning SGLang☆615Updated 3 weeks ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆113Updated 5 months ago
- Efficient and easy multi-instance LLM serving☆497Updated last month
- ☆91Updated last week
- The driver for LMCache core to run in vLLM☆54Updated 8 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆115Updated last year
- ☆507Updated last month
- 注释的nano_vllm仓库,并且完成了MiniCPM4的适配以及注册新模型的功能☆81Updated 2 months ago
- ☆56Updated 11 months ago
- SGLang kernel library for NPU☆64Updated this week
- ☆307Updated 3 weeks ago
- ☆148Updated 7 months ago
- GLake: optimizing GPU memory management and IO transmission.☆483Updated 7 months ago
- PyTorch distributed training acceleration framework☆53Updated 2 months ago
- Fast and memory-efficient exact attention☆96Updated this week
- ☆96Updated 6 months ago
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆439Updated this week
- 使用 CUDA C++ 实现的 llama 模型推理框架☆62Updated 11 months ago
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆220Updated last week
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆62Updated 2 weeks ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆199Updated 2 weeks ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆278Updated 4 months ago
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆292Updated last week
- Offline optimization of your disaggregated Dynamo graph☆79Updated this week
- Stateful LLM Serving☆87Updated 7 months ago
- Disaggregated serving system for Large Language Models (LLMs).☆706Updated 6 months ago
- ☆75Updated 11 months ago
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆138Updated last month