njuhope/cuda_sgemm

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/njuhope/cuda_sgemm)

njuhope / cuda_sgemm

☆121

Alternatives and similar repositories for cuda_sgemm

Users that are interested in cuda_sgemm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

Yinghan-Li / YHs_Sample
View on GitHub
Yinghan's Code Sample
☆365Jul 25, 2022Updated 4 years ago
tpoisonooo / how-to-optimize-gemm
View on GitHub
row-major matmul optimization
☆744May 14, 2026Updated 2 months ago
nicolaswilde / cuda-tensorcore-hgemm
View on GitHub
☆160Dec 26, 2024Updated last year
yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
View on GitHub
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
☆420Jan 2, 2025Updated last year
lixiuhong / implicit_gemm_convolution
View on GitHub
☆14May 28, 2019Updated 7 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
Cjkkkk / CUDA_gemm
View on GitHub
A simple high performance CUDA GEMM implementation.
☆437Jan 4, 2024Updated 2 years ago
AyakaGEMM / Hands-on-GEMM
View on GitHub
☆156Mar 18, 2024Updated 2 years ago
Liu-xiandong / How_to_optimize_in_GPU
View on GitHub
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…
☆1,335Jul 29, 2023Updated 3 years ago
MegEngine / cutlass
View on GitHub
CUDA Templates for Linear Algebra Subroutines
☆102Apr 25, 2024Updated 2 years ago
temporal-hpc / reduction-tensor-cores
View on GitHub
Fast GPU based tensor core reductions
☆12Jan 13, 2023Updated 3 years ago
nicolaswilde / cuda-sgemm
View on GitHub
☆73Jan 6, 2025Updated last year
Bruce-Lee-LY / cuda_hgemm
View on GitHub
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆558Sep 8, 2024Updated last year
CSshengxy / MEC
View on GitHub
ICML2017 MEC: Memory-efficient Convolution for Deep Neural Network C++实现(非官方)
☆17Apr 9, 2019Updated 7 years ago
wzsh / wmma_tensorcore_sample
View on GitHub
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆147Aug 18, 2020Updated 5 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
BBuf / how-to-optim-algorithm-in-cuda
View on GitHub
how to optimize some algorithm in cuda.
☆3,152Updated this week
TiledTensor / TiledCUDA
View on GitHub
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆192Jan 28, 2025Updated last year
jundaf2 / CUDA-INT8-GEMM
View on GitHub
CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API
☆37Sep 15, 2023Updated 2 years ago
cloudcores / CuAssembler
View on GitHub
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
☆609Apr 20, 2023Updated 3 years ago
flame / how-to-optimize-gemm
View on GitHub
☆2,025Jul 29, 2023Updated 3 years ago
wangzyon / NVIDIA_SGEMM_PRACTICE
View on GitHub
Step-by-step optimization of CUDA SGEMM
☆486Mar 30, 2022Updated 4 years ago
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆446Mar 5, 2026Updated 4 months ago
mrzhuzhe / riven
View on GitHub
CPU Memory Compiler and Parallel programing
☆26Nov 18, 2024Updated last year
flame / blislab
View on GitHub
BLISlab: A Sandbox for Optimizing GEMM
☆572Jun 17, 2021Updated 5 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
leimao / CUTLASS-Examples
View on GitHub
CUTLASS and CuTe Examples
☆137Nov 30, 2025Updated 7 months ago
LeiWang1999 / tvm_gpu_gemm
View on GitHub
play gemm with tvm
☆91Jul 22, 2023Updated 3 years ago
leimao / CUDA-GEMM-Optimization
View on GitHub
CUDA Matrix Multiplication Optimization
☆277Jul 19, 2024Updated 2 years ago
66RING / tiny-flash-attention
View on GitHub
flash attention tutorial written in python, triton, cuda, cutlass
☆528Jan 20, 2026Updated 6 months ago
reed-lau / cute-gemm
View on GitHub
☆189May 11, 2026Updated 2 months ago
shixun404 / Fault-Tolerant-SGEMM-on-NVIDIA-GPUs
View on GitHub
Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs
☆14Apr 3, 2025Updated last year
tempdban / docs
View on GitHub
☆12Apr 27, 2013Updated 13 years ago
Bruce-Lee-LY / cutlass_gemm
View on GitHub
Multiple GEMM operators are constructed with cutlass to support LLM inference.
☆20Aug 3, 2025Updated 11 months ago
daadaada / turingas
View on GitHub
Assembler for NVIDIA Volta and Turing GPUs
☆246Jan 13, 2022Updated 4 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
violetDelia / LLCompiler
View on GitHub
☆25Jun 11, 2025Updated last year
NVIDIA / online-softmax
View on GitHub
Benchmark code for the "Online normalizer calculation for softmax" paper
☆110Jul 27, 2018Updated 8 years ago
UDC-GAC / openCNN
View on GitHub
A Winograd Minimal Filter Implementation in CUDA
☆31Aug 25, 2021Updated 4 years ago
Bruce-Lee-LY / memory_pool
View on GitHub
Simple and efficient memory pool is implemented with C++11.
☆10Jun 2, 2022Updated 4 years ago
wmmae / wmma_extension
View on GitHub
An extension library of WMMA API (Tensor Core API)
☆115Jul 12, 2024Updated 2 years ago
mlbench / mlbench-benchmarks
View on GitHub
Distributed ML Training Benchmarks
☆27Mar 1, 2023Updated 3 years ago
siboehm / SGEMM_CUDA
View on GitHub
Fast CUDA matrix multiplication from scratch
☆1,265Sep 2, 2025Updated 10 months ago