zeroine/cutlass-cute-sample

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/zeroine/cutlass-cute-sample)

zeroine / cutlass-cute-sample

☆49

Alternatives and similar repositories for cutlass-cute-sample

Users that are interested in cutlass-cute-sample are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

DD-DuDa / Cute-Learning
View on GitHub
Examples of CUDA implementations by Cutlass CuTe
☆279Jul 1, 2025Updated last year
JJXiangJiaoJun / cutlass_gemv
View on GitHub
GEMV implementation with CUTLASS
☆21Aug 21, 2025Updated 11 months ago
reed-lau / cute-gemm
View on GitHub
☆186May 11, 2026Updated 2 months ago
Chtholly-Boss / swizzle
View on GitHub
A practical way of learning Swizzle
☆42Feb 3, 2025Updated last year
CalebDu / Awesome-Cute
View on GitHub
☆121May 16, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
leimao / CUTLASS-Examples
View on GitHub
CUTLASS and CuTe Examples
☆136Nov 30, 2025Updated 7 months ago
ColfaxResearch / cutlass-kernels
View on GitHub
☆269Jul 11, 2024Updated 2 years ago
66RING / tiny-flash-attention
View on GitHub
flash attention tutorial written in python, triton, cuda, cutlass
☆527Jan 20, 2026Updated 6 months ago
weishengying / tiny-flash-attention
View on GitHub
使用 cutlass 实现 flash-attention 精简版，具有教学意义
☆59Aug 12, 2024Updated last year
lcy-seso / DLFrameworkTest
View on GitHub
My tests and experiments with some popular dl frameworks.
☆17Sep 11, 2025Updated 10 months ago
TiledTensor / TiledBench
View on GitHub
Benchmark tests supporting the TiledCUDA library.
☆19Nov 19, 2024Updated last year
lixiuhong / implicit_gemm_convolution
View on GitHub
☆14May 28, 2019Updated 7 years ago
Bruce-Lee-LY / cuda_hgemm
View on GitHub
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆556Sep 8, 2024Updated last year
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆82Aug 12, 2024Updated last year
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
microsoft / TileFusion
View on GitHub
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆115Jun 28, 2025Updated last year
Bruce-Lee-LY / decoding_attention
View on GitHub
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆47Jun 11, 2025Updated last year
Yinghan-Li / YHs_Sample
View on GitHub
Yinghan's Code Sample
☆365Jul 25, 2022Updated 3 years ago
feifeibear / swGEMM
View on GitHub
A highly efficient library for GEMM operations on Sunway TaihuLight
☆18Sep 7, 2020Updated 5 years ago
Harry-Chen / fp4_sm120
View on GitHub
Make FP4 on 5090 Great Again
☆17Updated this week
microsoft / FractalTensor
View on GitHub
FractalTensor is a programming framework that introduces a novel approach to organizing data in deep neural networks (DNNs) as a list of …
☆32Dec 21, 2024Updated last year
xlite-dev / HGEMM
View on GitHub
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆157May 10, 2025Updated last year
leepoly / sm-profiler
View on GitHub
☆82Feb 5, 2026Updated 5 months ago
HuyNguyen-hust / hopper-gemm-101
View on GitHub
☆13Dec 22, 2024Updated last year
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
CSshengxy / MEC
View on GitHub
ICML2017 MEC: Memory-efficient Convolution for Deep Neural Network C++实现(非官方)
☆17Apr 9, 2019Updated 7 years ago
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆445Mar 5, 2026Updated 4 months ago
nox-410 / tvm.tl
View on GitHub
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
☆52Jul 23, 2024Updated last year
maxiaosong1124 / ncu-cuda-profiling-skill
View on GitHub
let coding agents use ncu skills analysis cuda program automatically!
☆117May 25, 2026Updated last month
Cjkkkk / CUDA_gemm
View on GitHub
A simple high performance CUDA GEMM implementation.
☆437Jan 4, 2024Updated 2 years ago
TiledTensor / TiledCUDA
View on GitHub
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆192Jan 28, 2025Updated last year
gxinlong / cuda-optimization-skill
View on GitHub
A skill for automatically optimizing CUDA code.
☆42Mar 26, 2026Updated 3 months ago
sunkx109 / My-Torch-Extension
View on GitHub
A minimalist and extensible PyTorch extension for implementing custom backend operators in PyTorch.
☆41Jan 24, 2026Updated 5 months ago
nicolaswilde / cuda-tensorcore-hgemm
View on GitHub
☆160Dec 26, 2024Updated last year
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
tspeterkim / flash-attention-minimal
View on GitHub
Flash Attention in ~100 lines of CUDA (forward pass only)
☆1,169Dec 30, 2024Updated last year
cornell-brg / torng-uecgra-scripts-hpca2021
View on GitHub
☆12Aug 4, 2022Updated 3 years ago
crispyberry / MLIR-Pass-Tour
View on GitHub
☆11Feb 28, 2023Updated 3 years ago
ankan-ban / llama2.cu
View on GitHub
Inference Llama 2 in one file of pure Cuda
☆17Aug 20, 2023Updated 2 years ago
yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
View on GitHub
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
☆420Jan 2, 2025Updated last year
luliyucoordinate / flash-attention-minimal
View on GitHub
Flash Attention in ~100 lines of CUDA (forward pass only)
☆12Jun 10, 2024Updated 2 years ago
HydraQYH / expert_specialization_moe
View on GitHub
Expert Specialization MoE Solution based on CUTLASS
☆27Apr 14, 2026Updated 3 months ago