tpoisonooo / how-to-optimize-gemm
View external linksLinks

row-major matmul optimization

☆701

Alternatives and similar repositories for how-to-optimize-gemm

Users that are interested in how-to-optimize-gemm are comparing it to the libraries listed below

Sorting:

flame / how-to-optimize-gemm
View on GitHub
☆1,988Jul 29, 2023Updated 2 years ago
Cjkkkk / CUDA_gemm
View on GitHub
A simple high performance CUDA GEMM implementation.
☆426Jan 4, 2024Updated 2 years ago
flame / blislab
View on GitHub
BLISlab: A Sandbox for Optimizing GEMM
☆555Jun 17, 2021Updated 4 years ago
Yinghan-Li / YHs_Sample
View on GitHub
Yinghan's Code Sample
☆365Jul 25, 2022Updated 3 years ago
tpoisonooo / chgemm
View on GitHub
symmetric int8 gemm
☆67Jun 7, 2020Updated 5 years ago
MegEngine / MegCC
View on GitHub
MegCC是一个运行时超轻量，高效，移植简单的深度学习模型编译器
☆488Oct 23, 2024Updated last year
njuhope / cuda_sgemm
View on GitHub
☆120Apr 11, 2024Updated last year
Liu-xiandong / How_to_optimize_in_GPU
View on GitHub
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…
☆1,239Jul 29, 2023Updated 2 years ago
BBuf / how-to-optim-algorithm-in-cuda
View on GitHub
how to optimize some algorithm in cuda.
☆2,819Updated this week
pigirons / cpufp
View on GitHub
A CPU tool for benchmarking the peak of floating points
☆576Feb 7, 2026Updated last week
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆410Updated this week
yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
View on GitHub
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
☆407Jan 2, 2025Updated last year
BBuf / how-to-optimize-gemm
View on GitHub
☆97Aug 8, 2021Updated 4 years ago
Bruce-Lee-LY / cuda_hgemm
View on GitHub
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆522Sep 8, 2024Updated last year
OpenPPL / ppl.nn
View on GitHub
A primitive library for neural network
☆1,368Nov 24, 2024Updated last year
BBuf / tvm_mlir_learn
View on GitHub
compiler learning resources collect.
☆2,678Mar 19, 2025Updated 10 months ago
MegEngine / mperf
View on GitHub
mperf是一个面向移动/嵌入式平台的算子性能调优工具箱
☆192Aug 17, 2023Updated 2 years ago
tlc-pack / cutlass_fpA_intB_gemm
View on GitHub
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Sep 13, 2025Updated 5 months ago
tlc-pack / libflash_attn
View on GitHub
Standalone Flash Attention v2 kernel without libtorch dependency
☆114Sep 10, 2024Updated last year
merrymercy / awesome-tensor-compilers
View on GitHub
A list of awesome compiler projects and papers for tensor computation and deep learning.
☆2,731Oct 19, 2024Updated last year
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆78Aug 12, 2024Updated last year
TiledTensor / TiledCUDA
View on GitHub
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆192Jan 28, 2025Updated last year
66RING / tiny-flash-attention
View on GitHub
flash attention tutorial written in python, triton, cuda, cutlass
☆486Jan 20, 2026Updated 3 weeks ago
AyakaGEMM / Hands-on-GEMM
View on GitHub
☆145Mar 18, 2024Updated last year
ColfaxResearch / cutlass-kernels
View on GitHub
☆261Jul 11, 2024Updated last year
wangzyon / NVIDIA_SGEMM_PRACTICE
View on GitHub
Step-by-step optimization of CUDA SGEMM
☆431Mar 30, 2022Updated 3 years ago
MegEngine / MegPeak
View on GitHub
☆256Sep 15, 2023Updated 2 years ago
OpenPPL / ppl.nn.llm
View on GitHub
☆141Apr 23, 2024Updated last year
xlite-dev / HGEMM
View on GitHub
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆148May 10, 2025Updated 9 months ago
OpenPPL / ppq
View on GitHub
PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.
☆1,781Mar 28, 2024Updated last year
microsoft / nnfusion
View on GitHub
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
☆1,006Sep 19, 2024Updated last year
bytedance / ByteTransformer
View on GitHub
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
☆477Mar 15, 2024Updated last year
OpenPPL / ppl.llm.kernel.cuda
View on GitHub
☆152Jan 9, 2025Updated last year
cloudcores / CuAssembler
View on GitHub
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
☆568Apr 20, 2023Updated 2 years ago
pigirons / conv3x3_m1
View on GitHub
This is a demo how to write a high performance convolution run on apple silicon
☆57Feb 8, 2022Updated 4 years ago
NVIDIA / cutlass
View on GitHub
CUDA Templates and Python DSLs for High-Performance Linear Algebra
☆9,266Updated this week
BBuf / ArmNeonOptimization
View on GitHub
arm-neon
☆92Aug 2, 2024Updated last year
Ldpe2G / ArmNeonOptimization
View on GitHub
Arm neon optimization practice
☆394Dec 22, 2020Updated 5 years ago
daquexian / faster-rwkv
View on GitHub
☆125Dec 15, 2023Updated 2 years ago

tpoisonooo / how-to-optimize-gemmView external linksLinks

Alternatives and similar repositories for how-to-optimize-gemm

tpoisonooo / how-to-optimize-gemm
View external linksLinks