debowin / cuda-tiled-2D-convolutionLinks
Optimized Parallel Tiled Approach to perform 2D Convolution by taking advantage of the lower latency, higher bandwidth shared memory as well as global constant memory cached aggresively within GPU thread blocks.
☆14Updated 7 years ago
Alternatives and similar repositories for cuda-tiled-2D-convolution
Users that are interested in cuda-tiled-2D-convolution are comparing it to the libraries listed below
Sorting:
- CUDA based GPU Programming☆34Updated last year
- ☆40Updated 4 years ago
- Study parallel programming - CUDA, OpenMP, MPI, Pthread☆57Updated 2 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆134Updated 4 years ago
- Examples showing how to utilize the NVML library for GPU monitoring☆28Updated 3 years ago
- Code samples for the CUDA tutorial "CUDA and Applications to Task-based Programming"☆88Updated last year
- Introduction to CUDA programming☆118Updated 8 years ago
- ☆17Updated 3 weeks ago
- Model zoo for the Quantized ONNX (QONNX) model format☆12Updated last week
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆30Updated last year
- This example starts with a simple sum reduction in CUDA, then steps through a series of optimizations we can perform to improve its perfo…☆13Updated 4 years ago
- A PyTorch implemenation of real XNOR-popcount (1-bit op) GEMM Linear PyTorch extension support both CPU and CUDA☆21Updated 2 years ago
- Serial and parallel implementations of matrix multiplication☆41Updated 4 years ago
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆38Updated 10 months ago
- ☆44Updated 4 years ago
- A tool to deploy Deep Neural Networks on PULP-based SoC's☆80Updated 3 months ago
- An extension library of WMMA API (Tensor Core API)☆97Updated 10 months ago
- High-Performance SGEMM on CUDA devices☆94Updated 4 months ago
- QONNX: Arbitrary-Precision Quantized Neural Networks in ONNX☆149Updated last week
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆70Updated 2 months ago
- PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.☆110Updated 6 months ago
- ☆45Updated 11 months ago
- The official, proof-of-concept C++ implementation of PocketNN.☆33Updated 11 months ago
- Awesome Quantization Paper lists with Codes☆11Updated 4 years ago
- Custom BLAS and LAPACK Cross-Compilation Framework for RISC-V☆19Updated 5 years ago
- This repository is a read-only mirror of https://gitlab.arm.com/kleidi/kleidiai☆44Updated this week
- Converting a deep neural network to integer-only inference in native C via uniform quantization and the fixed-point representation.☆25Updated 3 years ago
- Examples from Programming in Parallel with CUDA☆149Updated 2 years ago
- The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Inte…☆17Updated 6 years ago
- ☆18Updated 5 years ago