baidu / vLLM-KunlunLinks
vLLM Kunlun (vllm-kunlun) is a community-maintained hardware plugin designed to seamlessly run vLLM on the Kunlun XPU.
☆158Updated this week
Alternatives and similar repositories for vLLM-Kunlun
Users that are interested in vLLM-Kunlun are comparing it to the libraries listed below
Sorting:
- GLake: optimizing GPU memory management and IO transmission.☆491Updated 8 months ago
- RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.☆940Updated last week
- Disaggregated serving system for Large Language Models (LLMs).☆747Updated 8 months ago
- Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs☆896Updated 2 weeks ago
- A self-learning tutorail for CUDA High Performance Programing.☆775Updated 5 months ago
- ☆515Updated 3 weeks ago
- This repository organizes materials, recordings, and schedules related to AI-infra learning meetings.☆262Updated last week
- learning how CUDA works☆350Updated 9 months ago
- Efficient and easy multi-instance LLM serving☆517Updated 3 months ago
- how to learn PyTorch and OneFlow☆461Updated last year
- Persist and reuse KV Cache to speedup your LLM.☆158Updated this week
- DLRover: An Automatic Distributed Deep Learning System☆1,602Updated this week
- ☆75Updated last year
- NVIDIA Inference Xfer Library (NIXL)☆753Updated this week
- Materials for learning SGLang☆682Updated last week
- FlagGems is an operator library for large language models implemented in the Triton Language.☆797Updated this week
- The road to hack SysML and become an system expert☆501Updated last year
- FlagScale is a large model toolkit based on open-sourced projects.☆421Updated last week
- ☆73Updated last year
- KV cache store for distributed LLM inference☆371Updated last month
- Community maintained hardware plugin for vLLM on Ascend☆1,443Updated this week
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆172Updated 2 years ago
- DeepSeek-V3/R1 inference performance simulator☆169Updated 8 months ago
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆282Updated 3 months ago
- CUDA 算子手撕与面试指南☆726Updated 3 months ago
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.☆1,188Updated 3 months ago
- Distributed Compiler based on Triton for Parallel Systems☆1,269Updated this week
- ☆328Updated last month
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆118Updated 6 months ago
- ☆759Updated last month