MegEngine / cutlassLinks

CUDA Templates for Linear Algebra Subroutines

☆100

Alternatives and similar repositories for cutlass

Users that are interested in cutlass are comparing it to the libraries listed below

Sorting:

njuhope / cuda_sgemm
☆116Updated last year
BBuf / how-to-optimize-gemm
☆98Updated 4 years ago
pigirons / conv3x3_m1
This is a demo how to write a high performance convolution run on apple silicon
☆57Updated 3 years ago
MARD1NO / CUDA-PPT
☆112Updated 7 months ago
OpenPPL / CuAssembler
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
☆84Updated 2 years ago
Qualcomm-AI-research / FP8-quantization
☆165Updated 2 years ago
MegEngine / mgeconvert
MegEngine到其他框架的转换器
☆70Updated 2 years ago
aojunzz / NM-sparsity
☆243Updated 3 years ago
xuqiantong / CUDA-Winograd
Fast CUDA Kernels for ResNet Inference.
☆181Updated 6 years ago
OpenPPL / ppl.kernel.cpu
☆19Updated last year
AyakaGEMM / Hands-on-GEMM
☆143Updated last year
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆92Updated 2 years ago
DeepLink-org / CVFusion
CVFusion is an open-source deep learning compiler to fuse the OpenCV operators.
☆32Updated 3 years ago
yester31 / Cutlass_EX
study of cutlass
☆22Updated last year
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Updated 2 months ago
MegEngine / examples
A set of examples around MegEngine
☆31Updated last year
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆78Updated last year
mit-han-lab / inter-operator-scheduler
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
☆200Updated 3 years ago
nicolaswilde / cuda-tensorcore-hgemm
☆156Updated 10 months ago
OpenPPL / ppl.kernel.cuda
☆38Updated last year
ModelTC / mqbench-paper
☆44Updated 4 years ago
OpenPPL / ppl.pmx
☆60Updated last year
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆127Updated 6 months ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆189Updated 9 months ago
UDC-GAC / openCNN
A Winograd Minimal Filter Implementation in CUDA
☆28Updated 4 years ago
OpenPPL / ppl.llm.kernel.cuda
☆152Updated 10 months ago
ModelTC / pyvlova
Yet another Polyhedra Compiler for DeepLearning
☆19Updated 2 years ago
snap-research / F8Net
[ICLR 2022 Oral] F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization
☆94Updated 3 years ago
nicolaswilde / cuda-sgemm
☆70Updated 10 months ago
TRT2022 / MST-plus-plus-TensorRT
TensorRT 2022复赛方案：首个基于Transformer的图像重建模型MST++的TensorRT模型推断优化
☆143Updated 3 years ago