rchardx/cuda-gemm

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/rchardx/cuda-gemm)

rchardx / cuda-gemm

☆42

Alternatives and similar repositories for cuda-gemm

Users that are interested in cuda-gemm are comparing it to the libraries listed below

Sorting:

TiledTensor / TiledBench
View on GitHub
Benchmark tests supporting the TiledCUDA library.
☆18Nov 19, 2024Updated last year
TiledTensor / TiledLower
View on GitHub
TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.
☆14Nov 23, 2024Updated last year
pzhao-eng / FlashMLA
View on GitHub
☆62Feb 15, 2026Updated 2 weeks ago
lcy-seso / DLFrameworkTest
View on GitHub
My tests and experiments with some popular dl frameworks.
☆17Sep 11, 2025Updated 5 months ago
ademeure / DeeperGEMM
View on GitHub
DeeperGEMM: crazy optimized version
☆74May 5, 2025Updated 10 months ago
tile-ai / tvm
View on GitHub
Open deep learning compiler stack for cpu, gpu and specialized accelerators
☆19Feb 24, 2026Updated last week
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆79Aug 12, 2024Updated last year
zhuzilin / flash-attention-with-sink
View on GitHub
☆38Aug 7, 2025Updated 6 months ago
TiledTensor / TiledKernel
View on GitHub
TiledKernel is a code generation library based on macro kernels and memory hierarchy graph data structure.
☆19May 12, 2024Updated last year
Bruce-Lee-LY / flash_attention_inference
View on GitHub
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆44Feb 27, 2025Updated last year
luliyucoordinate / cute-flash-attention
View on GitHub
Implement Flash Attention using Cute.
☆102Dec 17, 2024Updated last year
IBM / triton-dejavu
View on GitHub
Framework to reduce autotune overhead to zero for well known deployments.
☆97Sep 19, 2025Updated 5 months ago
microsoft / TileFusion
View on GitHub
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆106Jun 28, 2025Updated 8 months ago
Chtholly-Boss / swizzle
View on GitHub
A practical way of learning Swizzle
☆37Feb 3, 2025Updated last year
lixiuhong / batched_gemm
View on GitHub
☆40Feb 28, 2020Updated 6 years ago
zeroine / cutlass-cute-sample
View on GitHub
☆49Apr 15, 2024Updated last year
IST-DASLab / qutlass
View on GitHub
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆168Nov 11, 2025Updated 3 months ago
ConvolutedDog / gpgpu-sim-comments
View on GitHub
GPGPU-Sim 中文注释版代码，包含 GPGPU-Sim 模拟器的最新版代码，经过中文注释，以帮助中文用户更好地理解和使用该模拟器。
☆28Dec 18, 2024Updated last year
weishengying / tiny-flash-attention
View on GitHub
使用 cutlass 实现 flash-attention 精简版，具有教学意义
☆58Aug 12, 2024Updated last year
simveit / persistent_dense_gemm
View on GitHub
Persistent dense gemm for Hopper in `CuTeDSL`
☆15Aug 9, 2025Updated 6 months ago
yu-yake2002 / ysyx-docker
View on GitHub
A docker image for One Student One Chip's debug exam
☆10Sep 22, 2023Updated 2 years ago
KuangjuX / cu-x
View on GitHub
🎉My Collections of CUDA Kernels~
☆11Jun 25, 2024Updated last year
reed-lau / cute-gemm
View on GitHub
☆168Feb 5, 2026Updated last month
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆409Updated this week
habanero-lab / APPy
View on GitHub
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆30Jan 28, 2026Updated last month
JJXiangJiaoJun / cutlass_gemv
View on GitHub
GEMV implementation with CUTLASS
☆19Aug 21, 2025Updated 6 months ago
eth-cscs / Tiled-MM
View on GitHub
Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.
☆32Apr 2, 2025Updated 11 months ago
xxyux / SpInfer
View on GitHub
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆60Mar 25, 2025Updated 11 months ago
maxiaosong1124 / ncu-cuda-profiling-skill
View on GitHub
let coding agents use ncu skills analysis cuda program automatically!
☆47Feb 5, 2026Updated last month
YashasSamaga / ConvolutionBuildingBlocks
View on GitHub
GEMM and Winograd based convolutions using CUTLASS
☆28Jul 15, 2020Updated 5 years ago
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆165Feb 11, 2026Updated 3 weeks ago
InternLM / turbomind
View on GitHub
☆97Mar 26, 2025Updated 11 months ago
ByteDance-Seed / cudaLLM
View on GitHub
☆134Aug 18, 2025Updated 6 months ago
DD-DuDa / Cute-Learning
View on GitHub
Examples of CUDA implementations by Cutlass CuTe
☆269Jul 1, 2025Updated 8 months ago
Oyami-Srk / RISCV-GDB-Paging
View on GitHub
Paging Debug tool for GDB using python
☆13Jun 4, 2022Updated 3 years ago
AnonymousYWL / MYLIB
View on GitHub
☆18Apr 8, 2022Updated 3 years ago
temporal-hpc / reduction-tensor-cores
View on GitHub
Fast GPU based tensor core reductions
☆13Jan 13, 2023Updated 3 years ago
sgl-project / DeepGEMM
View on GitHub
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆21Feb 9, 2026Updated 3 weeks ago
dame-cell / Triformer
View on GitHub
Transformers components but in Triton
☆34May 9, 2025Updated 9 months ago