paramhanji / CUDA-CNN
Implementation of a simple CNN using CUDA
☆68Updated 8 years ago
Alternatives and similar repositories for CUDA-CNN
Users that are interested in CUDA-CNN are comparing it to the libraries listed below
Sorting:
- Fast CUDA Kernels for ResNet Inference.☆174Updated 5 years ago
- CUDA for MNIST training/inference☆40Updated last year
- cuDNN sample codes provided by Nvidia☆45Updated 6 years ago
- how to design cpu gemm on x86 with avx256, that can beat openblas.☆70Updated 6 years ago
- [MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration☆199Updated 3 years ago
- ☆111Updated last year
- Automatic Schedule Exploration and Optimization Framework for Tensor Computations☆176Updated 3 years ago
- Subpart source code of of deepcore v0.7☆27Updated 4 years ago
- play gemm with tvm☆91Updated last year
- implementation of winograd minimal convolution algorithm on Intel Architecture☆39Updated 7 years ago
- ☆96Updated 3 years ago
- ☆38Updated 3 years ago
- Inference of quantization aware trained networks using TensorRT☆80Updated 2 years ago
- ☆18Updated 4 years ago
- A Winograd Minimal Filter Implementation in CUDA☆24Updated 3 years ago
- CUDA Matrix Multiplication Optimization☆186Updated 9 months ago
- ☆38Updated 5 years ago
- PyTorch -> ONNX -> TVM for autotuning☆24Updated 5 years ago
- Manually implemented quantization-aware training☆21Updated 2 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆131Updated 4 years ago
- symmetric int8 gemm☆66Updated 4 years ago
- GEMM and Winograd based convolutions using CUTLASS☆26Updated 4 years ago
- Dissecting NVIDIA GPU Architecture☆94Updated 2 years ago
- CUDA Templates for Linear Algebra Subroutines☆99Updated last year
- CUDA PTX-ISA Document 中文翻译版☆39Updated 2 months ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆61Updated 8 months ago
- code reading for tvm☆76Updated 3 years ago
- ResNet Implementation, Training, and Inference Using LibTorch C++ API☆40Updated 11 months ago
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆32Updated 4 years ago
- An unofficial cuda assembler, for all generations of SASS, hopefully :)☆83Updated 2 years ago