debowin / cuda-tiled-2D-convolutionLinks

Optimized Parallel Tiled Approach to perform 2D Convolution by taking advantage of the lower latency, higher bandwidth shared memory as well as global constant memory cached aggresively within GPU thread blocks.

☆14

Alternatives and similar repositories for cuda-tiled-2D-convolution

Users that are interested in cuda-tiled-2D-convolution are comparing it to the libraries listed below

Sorting:

jeonggunlee / CUDATeaching
CUDA based GPU Programming
☆38Updated last year
codeplaysoftware / SYCL-For-CUDA-Examples
Examples for using SYCL on CUDA
☆62Updated 2 months ago
CUDA-Tutorial / CodeSamples
Code samples for the CUDA tutorial "CUDA and Applications to Task-based Programming"
☆94Updated 2 years ago
gevtushenko / matrix_format_performance
☆29Updated 6 years ago
ekondis / gpuroofperf-toolkit
A GPU performance prediction toolkit for CUDA programs
☆18Updated 6 years ago
junstar92 / parallel_programming_study
Study parallel programming - CUDA, OpenMP, MPI, Pthread
☆60Updated 3 years ago
ROCm / hip-tests
☆38Updated last week
olcf / NVIDIA-tensor-core-examples
☆20Updated 6 years ago
tgautam03 / tGeMM
General Matrix Multiplication using NVIDIA Tensor Cores
☆24Updated 9 months ago
wzsh / wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆145Updated 5 years ago
pkestene / MS-HPC-AI-GPU
resources pour le cours d'introduction à la programmation des GPUs du mastère spécialisé HPC-AI
☆23Updated last year
rbaygildin / learn-gpgpu
Algorithms implemented in CUDA + resources about GPGPU
☆60Updated 3 years ago
mnicely / nvml_examples
Examples showing how to utilize the NVML library for GPU monitoring
☆29Updated 3 years ago
rox906 / tcFFT
☆41Updated 4 years ago
ecrc / kblas-gpu
Subset of BLAS routines optimized for NVIDIA GPUs
☆73Updated 2 years ago
olcf-tutorials / MPI_ping_pong
This tutorial demonstrates how to use CUDA-Aware MPI
☆38Updated 2 years ago
Bruce-Lee-LY / cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
☆12Updated 2 years ago
jeffhammond / dpcpp-tutorial
Intel Data Parallel C++ (and SYCL 2020) Tutorial.
☆95Updated 3 years ago
csc-training / CUDA
Introduction to CUDA programming
☆129Updated 8 years ago
intel / tiny-tensor-compiler
☆19Updated last week
ysh329 / OpenMP-101
Learn OpenMP examples step by step
☆99Updated 10 months ago
kberkay / Cuda-Matrix-Multiplication
Matrix Multiplication on GPU using Shared Memory considering Coalescing and Bank Conflicts
☆25Updated 3 years ago
enp1s0 / ozIMMU
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
☆92Updated 8 months ago
moritztng / grayskull-attention
Attention in SRAM on Tenstorrent Grayskull
☆38Updated last year
puttsk / cuda-tutorial
A set of hands-on tutorials for CUDA programming
☆241Updated last year
jhson989 / cuda-ptx
Inline PTX Assembly in CUDA example
☆13Updated 3 years ago
spcl / FBLAS
BLAS implementation for Intel FPGA
☆77Updated 5 years ago
tgmattso / OmpCommonCore
Software to support people learning OpenMP with our book ... The OpenMP Common Core: Making OpenMP Simple Again
☆83Updated 2 years ago
umfranzw / cuda-reduction-example
This example starts with a simple sum reduction in CUDA, then steps through a series of optimizations we can perform to improve its perfo…
☆12Updated 5 years ago
RichardAns / CUDA-Programs
Examples from Programming in Parallel with CUDA
☆165Updated 2 years ago