cmpute / pytorch-cmake-exampleLinks
Example to build PyTorch CUDA extension using CMake (with pybind11 and scikit-build)
☆12Updated 5 years ago
Alternatives and similar repositories for pytorch-cmake-example
Users that are interested in pytorch-cmake-example are comparing it to the libraries listed below
Sorting:
- ☆33Updated 4 years ago
- A library of GPU kernels for sparse matrix operations.☆280Updated 5 years ago
- Kernel Tuner☆377Updated last week
- CUDA Matrix Multiplication Optimization☆247Updated last year
- Efficient SpGEMM on GPU using CUDA and CSR☆59Updated 2 years ago
- Step-by-step optimization of CUDA SGEMM☆416Updated 3 years ago
- The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resou…☆488Updated last week
- An extension library of WMMA API (Tensor Core API)☆109Updated last year
- Helpful kernel tutorials and examples for tile-based GPU programming☆456Updated last week
- CUTLASS and CuTe Examples☆114Updated 3 weeks ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆146Updated 5 years ago
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆367Updated last week
- ☆254Updated last year
- ☆606Updated last week
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆691Updated last week
- Code samples for the CUDA tutorial "CUDA and Applications to Task-based Programming"☆94Updated 2 years ago
- CUDA Kernel Benchmarking Library☆782Updated 2 weeks ago
- Training material for Nsight developer tools☆173Updated last year
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆91Updated 2 years ago
- ☆186Updated last year
- ☆127Updated 2 months ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆191Updated 10 months ago
- Examples demonstrating available options to program multiple GPUs in a single node or a cluster☆843Updated 3 months ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆341Updated 3 weeks ago
- A simple high performance CUDA GEMM implementation.☆421Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆104Updated 5 months ago
- Fastest kernels written from scratch☆501Updated 3 months ago
- ☆42Updated 4 years ago
- Online CUDA Occupancy Calculator☆81Updated 4 years ago
- A plugin for Jupyter Notebook to run CUDA C/C++ code☆257Updated last year