LMCache / LMCache-AscendLinks
LMCache on Ascend
☆43Updated last week
Alternatives and similar repositories for LMCache-Ascend
Users that are interested in LMCache-Ascend are comparing it to the libraries listed below
Sorting:
- GLake: optimizing GPU memory management and IO transmission.☆494Updated 9 months ago
- Efficient and easy multi-instance LLM serving☆520Updated 4 months ago
- Disaggregated serving system for Large Language Models (LLMs).☆761Updated 9 months ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆123Updated 2 weeks ago
- SGLang kernel library for NPU☆91Updated this week
- FlagCX is a scalable and adaptive cross-chip communication library.☆138Updated last week
- Offline optimization of your disaggregated Dynamo graph☆137Updated this week
- DeepSeek-V3/R1 inference performance simulator☆175Updated 9 months ago
- NVIDIA Inference Xfer Library (NIXL)☆788Updated this week
- Materials for learning SGLang☆714Updated this week
- This repository organizes materials, recordings, and schedules related to AI-infra learning meetings.☆288Updated this week
- KV cache store for distributed LLM inference☆384Updated last month
- A low-latency & high-throughput serving engine for LLMs☆462Updated 2 months ago
- ☆337Updated last week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆615Updated this week
- A self-learning tutorail for CUDA High Performance Programing.☆803Updated 6 months ago
- vLLM Kunlun (vllm-kunlun) is a community-maintained hardware plugin designed to seamlessly run vLLM on the Kunlun XPU.☆212Updated this week
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆289Updated 4 months ago
- Perplexity GPU Kernels☆548Updated 2 months ago
- ☆92Updated 9 months ago
- FlagGems is an operator library for large language models implemented in the Triton Language.☆824Updated this week
- A lightweight design for computation-communication overlap.☆207Updated 2 weeks ago
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆172Updated 2 years ago
- 注释的nano_vllm仓库,并且完成了MiniCPM4的适配以及注册新模型的功能☆134Updated 5 months ago
- Ascend TileLang adapter☆177Updated this week
- Stateful LLM Serving☆93Updated 10 months ago
- A large-scale simulation framework for LLM inference☆509Updated 5 months ago
- High performance Transformer implementation in C++.☆148Updated 11 months ago
- SGLang is a fast serving framework for large language models and vision language models.☆26Updated this week
- Summary of some awesome work for optimizing LLM inference☆161Updated last month