A high-throughput and memory-efficient inference and serving engine for LLMs
☆79Mar 20, 2026Updated this week
Alternatives and similar repositories for vllm-musa
Users that are interested in vllm-musa are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Fork from https://github.com/deepseek-ai/FlashMLA☆16Feb 26, 2025Updated last year
- torch_musa is an open source repository based on PyTorch, which can make full use of the super computing power of MooreThreads graphics c…☆484Mar 17, 2026Updated last week
- RISCV C and Triton AI-Benchmark☆22Jan 28, 2026Updated last month
- ☆25Mar 15, 2023Updated 3 years ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆118Mar 13, 2024Updated 2 years ago
- a static analytical model for LLM distributed training☆123Jan 8, 2026Updated 2 months ago
- ☆155Mar 4, 2025Updated last year
- DeepSeek-V3/R1 inference performance simulator☆189Mar 27, 2025Updated 11 months ago
- pytorch code examples for measuring the performance of collective communication calls in AI workloads☆19Sep 18, 2025Updated 6 months ago
- A Winograd Minimal Filter Implementation in CUDA☆28Aug 25, 2021Updated 4 years ago
- Run code-llama with 50k tokens using flash attention and better transformer☆12Nov 21, 2023Updated 2 years ago
- ncnn is a high-performance neural network inference framework optimized for the mobile platform☆14May 20, 2022Updated 3 years ago
- ☆32Aug 24, 2022Updated 3 years ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆73Sep 8, 2024Updated last year
- ☆10Feb 11, 2025Updated last year
- Aioli: A unified optimization framework for language model data mixing☆32Jan 17, 2025Updated last year
- GPU implementation of Winograd convolution☆10Oct 23, 2017Updated 8 years ago
- Development repository for the Triton-Linalg conversion☆217Feb 7, 2025Updated last year
- An IR for efficiently simulating distributed ML computation.☆32Jan 13, 2024Updated 2 years ago
- this is for intel openvino (https://software.intel.com/en-us/openvino-toolkit)☆18Oct 24, 2018Updated 7 years ago
- ☆66Feb 4, 2026Updated last month
- ☆15May 2, 2018Updated 7 years ago
- Fast GPU based tensor core reductions☆13Jan 13, 2023Updated 3 years ago
- VastModelZOO 是瀚博半导体VastAI-AIS团队维护的AI模型库,提供了人工智能多个领域(CV、AUDIO、NLP、LLM、MLLM等)的开源模型在瀚博训推芯片上的部署、训练示例。☆28Updated this week
- Official repository of paper "Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models"☆23May 27, 2025Updated 9 months ago
- Agentman: A tool for building and managing AI agents☆15Jul 10, 2025Updated 8 months ago
- Source code of the IPDPS '21 paper: "TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs" by Yuyao Niu, Zhengyang…☆12Aug 12, 2022Updated 3 years ago
- Code for the paper "Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching" (COLING 2025)☆19Jan 3, 2026Updated 2 months ago
- A unified programming framework for high and portable performance across FPGAs and GPUs☆11Mar 23, 2025Updated last year
- An adapter layer that ensures torch_musa🔦 delivers a CUDA-compatible PyTorch experience.☆31Updated this week
- a simple API to use CUPTI☆10Aug 19, 2025Updated 7 months ago
- A intelligent matrix format designer for SpMV☆10Oct 10, 2023Updated 2 years ago
- ☆15Mar 30, 2024Updated last year
- Mathematical expression evaluator with just in time code generation.☆12Apr 7, 2013Updated 12 years ago
- ☆20May 24, 2025Updated 10 months ago
- DiscreteTom's Blog Boilerplate.☆10Mar 6, 2023Updated 3 years ago
- ☆13Jun 23, 2022Updated 3 years ago
- Convolutional Neural Network of vgg19 model using Cuda to accelerate☆12Jun 11, 2018Updated 7 years ago
- Cheng-Hao Tu, Jia-Hong Lee, Yi-Ming Chan and Chu-Song Chen, "Pruning Depthwise Separable Convolutions for MobileNet Compression," Interna…☆16Jan 8, 2021Updated 5 years ago