LMCache / LMCache-AscendLinks
LMCache on Ascend
☆45Updated last week
Alternatives and similar repositories for LMCache-Ascend
Users that are interested in LMCache-Ascend are comparing it to the libraries listed below
Sorting:
- Disaggregated serving system for Large Language Models (LLMs).☆772Updated 9 months ago
- Materials for learning SGLang☆728Updated 3 weeks ago
- GLake: optimizing GPU memory management and IO transmission.☆497Updated 10 months ago
- SGLang kernel library for NPU☆95Updated last week
- Offline optimization of your disaggregated Dynamo graph☆168Updated this week
- Efficient and easy multi-instance LLM serving☆523Updated 4 months ago
- This repository organizes materials, recordings, and schedules related to AI-infra learning meetings.☆312Updated 3 weeks ago
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆659Updated this week
- DeepSeek-V3/R1 inference performance simulator☆177Updated 10 months ago
- NVIDIA Inference Xfer Library (NIXL)☆844Updated this week
- KV cache store for distributed LLM inference☆389Updated 2 months ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆123Updated last month
- vLLM Kunlun (vllm-kunlun) is a community-maintained hardware plugin designed to seamlessly run vLLM on the Kunlun XPU.☆239Updated this week
- ☆340Updated 3 weeks ago
- High Performance LLM Inference Operator Library☆222Updated last week
- SGLang is a fast serving framework for large language models and vision language models.☆27Updated this week
- This repository is established to store personal notes and annotated papers during daily research.☆175Updated last week
- Curated collection of papers in machine learning systems☆500Updated last month
- FlagCX is a scalable and adaptive cross-chip communication library.☆170Updated last week
- 注释的nano_vllm仓库,并且完成了MiniCPM4的适配以及注册新模型的功能☆147Updated 5 months ago
- FlagGems is an operator library for large language models implemented in the Triton Language.☆887Updated this week
- A self-learning tutorail for CUDA High Performance Programing.☆854Updated 2 weeks ago
- Persist and reuse KV Cache to speedup your LLM.☆244Updated this week
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆260Updated this week
- A low-latency & high-throughput serving engine for LLMs☆470Updated 3 weeks ago
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆298Updated last week
- Ascend TileLang adapter☆196Updated this week
- A lightweight design for computation-communication overlap.☆213Updated last week
- 📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).☆62Updated 9 months ago
- High performance Transformer implementation in C++.☆148Updated last year