debowin / cuda-tiled-2D-convolutionLinks
Optimized Parallel Tiled Approach to perform 2D Convolution by taking advantage of the lower latency, higher bandwidth shared memory as well as global constant memory cached aggresively within GPU thread blocks.
☆14Updated 8 years ago
Alternatives and similar repositories for cuda-tiled-2D-convolution
Users that are interested in cuda-tiled-2D-convolution are comparing it to the libraries listed below
Sorting:
- Code samples for the CUDA tutorial "CUDA and Applications to Task-based Programming"☆95Updated 2 years ago
- ☆43Updated 4 years ago
- General Matrix Multiplication using NVIDIA Tensor Cores☆28Updated last year
- A set of hands-on tutorials for CUDA programming☆247Updated last year
- Examples for using SYCL on CUDA☆63Updated 5 months ago
- ☆21Updated 3 weeks ago
- Algorithms implemented in CUDA + resources about GPGPU☆62Updated 4 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆145Updated 5 years ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆111Updated 2 months ago
- ☆59Updated 4 months ago
- Introduction to CUDA programming☆129Updated 8 years ago
- ☆49Updated 5 years ago
- ☆16Updated 3 months ago
- Attention in SRAM on Tenstorrent Grayskull☆40Updated last year
- Inline PTX Assembly in CUDA example☆13Updated 3 years ago
- Examples showing how to utilize the NVML library for GPU monitoring☆29Updated 3 years ago
- Matrix Multiplication on GPU using Shared Memory considering Coalescing and Bank Conflicts☆25Updated 3 years ago
- Fast and full-featured Matrix Market I/O library for C++, Python, and R☆86Updated last year
- ☆29Updated 6 years ago
- Some CUDA design patterns and a bit of template magic for CUDA☆158Updated 2 years ago
- ☆38Updated this week
- ☆20Updated 6 years ago
- cuASR: CUDA Algebra for Semirings☆44Updated 3 years ago
- BLAS implementation for Intel FPGA☆78Updated 5 years ago
- This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace (HF) hub (developed using HF transfor…☆85Updated this week
- A minimal cmake based project skeleton for developping a CUDA application☆17Updated 2 years ago
- This tutorial demonstrates how to use CUDA-Aware MPI☆39Updated 2 years ago
- ☆14Updated 11 months ago
- Serial and parallel implementations of matrix multiplication☆45Updated 4 years ago
- GPUOcelot: A dynamic compilation framework for PTX☆219Updated last year