Hardware-Alchemy / cuDNN-sample
cuDNN sample codes provided by Nvidia
☆42Updated 5 years ago
Related projects: ⓘ
- Benchmark code for the "Online normalizer calculation for softmax" paper☆52Updated 6 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆109Updated 4 years ago
- Instructions, Docker images, and examples for Nsight Compute and Nsight Systems☆126Updated 4 years ago
- ☆73Updated 5 months ago
- CUDA for MNIST training/inference☆37Updated 8 months ago
- Dissecting NVIDIA GPU Architecture☆78Updated 2 years ago
- A tool for examining GPU scheduling behavior.☆67Updated last month
- An extension library of WMMA API (Tensor Core API)☆81Updated 2 months ago
- THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.☆81Updated 6 months ago
- tophub autotvm log collections☆70Updated last year
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆31Updated 4 years ago
- Fast CUDA Kernels for ResNet Inference.☆164Updated 5 years ago
- CUDA Matrix Multiplication Optimization☆118Updated 2 months ago
- ☆38Updated 4 years ago
- [MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration☆191Updated 2 years ago
- ☆39Updated 3 years ago
- Python bindings for NVTX☆66Updated last year
- TVM stack: exploring the incredible explosion of deep-learning frameworks and how to bring them together☆63Updated 6 years ago
- MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.☆123Updated 11 months ago
- Assembler for NVIDIA Volta and Turing GPUs☆195Updated 2 years ago
- Training material for Nsight developer tools☆125Updated last month
- A Winograd Minimal Filter Implementation in CUDA☆20Updated 3 years ago
- Automatic Schedule Exploration and Optimization Framework for Tensor Computations☆175Updated 2 years ago
- ☆34Updated 2 years ago
- ☆34Updated 3 years ago
- Code for paper "Design Principles for Sparse Matrix Multiplication on the GPU" accepted to Euro-Par 2018☆70Updated 3 years ago
- ☆17Updated 4 years ago
- Some source code about matrix multiplication implementation on CUDA☆35Updated 6 years ago
- Benchmark scripts for TVM☆73Updated 2 years ago
- ☆100Updated 5 months ago