MooreThreads / torch_musa
torch_musa is an open source repository based on PyTorch, which can make full use of the super computing power of MooreThreads graphics cards.
☆395Updated last week
Alternatives and similar repositories for torch_musa
Users that are interested in torch_musa are comparing it to the libraries listed below
Sorting:
- ☆118Updated last year
- Ascend PyTorch adapter (torch_npu). Mirror of https://gitee.com/ascend/pytorch☆354Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆49Updated 6 months ago
- FlagGems is an operator library for large language models implemented in the Triton Language.☆528Updated this week
- a lightweight LLM model inference framework☆727Updated last year
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆473Updated last year
- Efficient operation implementation based on the Cambricon Machine Learning Unit (MLU) .☆116Updated this week
- PaddlePaddle custom device implementaion. (『飞桨』自定义硬件接入实现)☆83Updated this week
- Triton Documentation in Chinese Simplified / Triton 中文文档☆71Updated last month
- A CPU tool for benchmarking the peak of floating points☆543Updated last week
- ☆140Updated 4 months ago
- The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )☆223Updated 5 months ago
- Machine learning compiler based on MLIR for Sophgo TPU.☆719Updated this week
- MegCC是一个运行时超轻量,高效,移植简单的深度学习模型编译器☆484Updated 6 months ago
- BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.☆864Updated 4 months ago
- This is an implementation of sgemm_kernel on L1d cache.☆228Updated last year
- llama 2 Inference☆42Updated last year
- Yinghan's Code Sample☆327Updated 2 years ago
- llm-export can export llm model to onnx.☆289Updated 4 months ago
- stable diffusion using mnn☆68Updated last year
- 本项目是一个通过文字生成图片的项目,基于开源模型Stable Diffusion V1.5生成可以在手机的CPU和NPU上运行的模型,包括其配套的模型运行框架。☆188Updated last year
- 📚FFPA(Split-D): Extend FlashAttention with Split-D for large headdim, O(1) GPU SRAM complexity, 1.8x~3x↑🎉 faster than SDPA EA.☆174Updated last week
- Run generative AI models in sophgo BM1684X/BM1688☆208Updated last week
- A tutorial for CUDA&PyTorch☆140Updated 3 months ago
- ☆237Updated 3 months ago
- ☆148Updated 4 months ago
- [EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a V…☆473Updated this week
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆251Updated this week
- row-major matmul optimization☆629Updated last year
- ☆162Updated last month