KernelTuner / kernel_tuner
Kernel Tuner
☆287Updated this week
Related projects ⓘ
Alternatives and complementary repositories for kernel_tuner
- CUDA Kernel Benchmarking Library☆519Updated this week
- An implementation of BLAS using the SYCL open standard.☆259Updated 3 weeks ago
- Examples demonstrating available options to program multiple GPUs in a single node or a cluster☆561Updated 3 weeks ago
- STREAM, for lots of devices written in many programming models☆326Updated 2 months ago
- ☆486Updated this week
- Assembler for NVIDIA Volta and Turing GPUs☆202Updated 2 years ago
- collection of benchmarks to measure basic GPU capabilities☆264Updated 5 months ago
- Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm☆196Updated 2 weeks ago
- Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators☆315Updated this week
- DaCe - Data Centric Parallel Programming☆499Updated this week
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆190Updated this week
- Examples for HIP☆200Updated 2 weeks ago
- A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).☆518Updated 6 months ago
- Advanced Profiling and Analytics for AMD Hardware☆138Updated this week
- Stretching GPU performance for GEMMs and tensor contractions.☆223Updated this week
- CUDA kernel author's tools☆109Updated 2 years ago
- ROCm Parallel Primitives☆162Updated this week
- A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP)☆363Updated 3 months ago
- ROCm Communication Collectives Library (RCCL)☆270Updated this week
- CLTune: An automatic OpenCL & CUDA kernel tuner☆170Updated last year
- The Foundation for All Legate Libraries☆193Updated last week
- ☆59Updated this week
- MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.☆124Updated last year
- Next generation BLAS implementation for ROCm platform☆346Updated this week
- ☆224Updated 2 months ago
- A code generator for array-based code on CPUs and GPUs☆589Updated this week
- Instructions, Docker images, and examples for Nsight Compute and Nsight Systems☆128Updated 4 years ago
- An extension library of WMMA API (Tensor Core API)☆84Updated 4 months ago
- A Library for fast Hash Tables on GPUs☆109Updated 2 years ago
- GPUOcelot: A dynamic compilation framework for PTX☆147Updated 2 months ago