matrix97317 / OneNeuralNetworkLinks

This is a cross-chip platform collection of operators and a unified neural network library.

☆18

Alternatives and similar repositories for OneNeuralNetwork

Users that are interested in OneNeuralNetwork are comparing it to the libraries listed below

Sorting:

gfvvz / triton-learning-materials
Triton Compiler related materials.
☆35Updated 10 months ago
InfiniTensor / RefactorGraph
分层解耦的深度学习推理引擎
☆76Updated 9 months ago
MARD1NO / CUDA-PPT
☆112Updated 7 months ago
GetUpEarlier / minit
☆27Updated last year
njuhope / cuda_sgemm
☆116Updated last year
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆186Updated 9 months ago
tongzhou80 / nanoPyC
☆70Updated 2 years ago
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆92Updated 2 years ago
OpenPPL / CuAssembler
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
☆84Updated 2 years ago
flagos-ai / flagtree
FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.
☆131Updated last week
ArthurinRUC / cutlass-notes
From Minimal GEMM to Everything
☆79Updated last week
billmuch / matmul_perf_test
☆15Updated 3 years ago
XiaoSong9905 / dgemm-knl
DGEMM on KNL, achieve 75% MKL
☆18Updated 3 years ago
nicolaswilde / cuda-tensorcore-hgemm
☆156Updated 10 months ago
galois-stack / galois
a tensor computing compiler based tile programming for gpu, cpu or tpu
☆46Updated 2 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆127Updated 6 months ago
JackonYang / hands-on-tvm
hands on model tuning with TVM and profile it on a Mac M1, x86 CPU, and GTX-1080 GPU.
☆50Updated 2 years ago
StrongSpoon / tvm.schedule
examples for tvm schedule API
☆101Updated 2 years ago
QianyanTech / NBAssembler
Assembler and Decompiler for NVIDIA (Maxwell Pascal Volta Turing Ampere) GPUs.
☆91Updated 2 years ago
starmee / AI-Notes
My learning notes about AI, including Machine Learning and Deep Learning.
☆18Updated 6 years ago
Cambricon / mlu-ops
Efficient operation implementation based on the Cambricon Machine Learning Unit (MLU) .
☆138Updated last week
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆78Updated last year
AyakaGEMM / Hands-on-GEMM
☆143Updated last year
CalvinXKY / BasicCUDA
A tutorial for CUDA&PyTorch
☆165Updated 9 months ago
DeepLink-org / DLOP-Bench
A benchmark suited especially for deep learning operators
☆42Updated 2 years ago
InfiniTensor / InfiniTensor
☆273Updated 3 weeks ago
CalebDu / Awesome-Cute
☆110Updated 6 months ago
Syencil / Programming_Massively_Parallel_Processors
CUDA 6大并行计算模式代码与笔记
☆61Updated 5 years ago
BBuf / how-to-optimize-gemm
☆98Updated 4 years ago
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆134Updated 2 years ago