te42kyfo / gpu-benches

collection of benchmarks to measure basic GPU capabilities

☆265

Related projects ⓘ

Alternatives and complementary repositories for gpu-benches

daadaada / turingas
Assembler for NVIDIA Volta and Turing GPUs
☆201Updated 2 years ago
sunlex0717 / DissectingTensorCores
☆80Updated 7 months ago
Bruce-Lee-LY / cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆302Updated 2 months ago
leimao / CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
☆141Updated 4 months ago
microsoft / triton-shared
Shared Middle-Layer for Triton Compilation
☆191Updated this week
yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
☆280Updated 2 years ago
sjfeng1999 / gpu-arch-microbenchmark
Dissecting NVIDIA GPU Architecture
☆82Updated 2 years ago
ROCm / composable_kernel
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
☆313Updated this week
microsoft / mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
☆250Updated this week
reed-lau / cute-gemm
☆79Updated 8 months ago
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆84Updated 4 months ago
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆290Updated 2 months ago
Yinghan-Li / YHs_Sample
Yinghan's Code Sample
☆289Updated 2 years ago
cloudcores / CuAssembler
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
☆405Updated last year
NVIDIA / TensorRT-Incubator
Experimental projects related to TensorRT
☆81Updated this week
Cjkkkk / CUDA_gemm
A simple high performance CUDA GEMM implementation.
☆335Updated 10 months ago
intel / intel-xpu-backend-for-triton
OpenAI Triton backend for Intel® GPUs
☆143Updated this week
nicolaswilde / cuda-tensorcore-hgemm
☆110Updated 2 years ago
ColfaxResearch / cutlass-kernels
☆167Updated 4 months ago
wzsh / wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆115Updated 4 years ago
NVlabs / NVBit
☆224Updated 2 months ago
NVIDIA / nvbandwidth
A tool for bandwidth measurements on NVIDIA GPUs.
☆321Updated last month
intel / xetla
☆59Updated this week
NVIDIA / nvbench
CUDA Kernel Benchmarking Library
☆519Updated this week
OpenPPL / ppl.llm.kernel.cuda
☆138Updated 2 weeks ago
bytedance / flux
A fast communication-overlapping library for tensor parallelism on GPUs.
☆224Updated 3 weeks ago
wangzyon / NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
☆240Updated 2 years ago
ColfaxResearch / cfx-article-src
☆48Updated this week
ROCm / rocMLIR
☆128Updated this week
ROCm / Tensile
Stretching GPU performance for GEMMs and tensor contractions.
☆223Updated this week