cmpute / pytorch-cmake-example
Example to build PyTorch CUDA extension using CMake (with pybind11 and scikit-build)
☆11Updated 4 years ago
Alternatives and similar repositories for pytorch-cmake-example:
Users that are interested in pytorch-cmake-example are comparing it to the libraries listed below
- ☆29Updated 3 years ago
- CUDA Matrix Multiplication Optimization☆161Updated 7 months ago
- Training material for Nsight developer tools☆147Updated 6 months ago
- A library of GPU kernels for sparse matrix operations.☆254Updated 4 years ago
- ☆177Updated this week
- Step-by-step optimization of CUDA SGEMM☆280Updated 2 years ago
- ☆43Updated last month
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆124Updated 4 years ago
- ☆180Updated 7 months ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆175Updated 3 weeks ago
- Fastest kernels written from scratch☆162Updated this week
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆63Updated 4 years ago
- ☆159Updated 8 months ago
- ☆70Updated last month
- An extension library of WMMA API (Tensor Core API)☆88Updated 7 months ago
- CUTLASS and CuTe Examples☆38Updated last month
- extensible collectives library in triton☆82Updated 4 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆51Updated last week
- CUDA Kernel Benchmarking Library☆560Updated 2 months ago
- Cataloging released Triton kernels.☆167Updated last month
- Fast Hadamard transform in CUDA, with a PyTorch interface☆141Updated 8 months ago
- Template for starting CUDA/C++ project using CMake with Github Action for CI☆29Updated 2 years ago
- A Easy-to-understand TensorOp Matmul Tutorial☆316Updated 4 months ago
- ☆67Updated 3 months ago
- Experimental projects related to TensorRT☆89Updated this week
- Applied AI experiments and examples for PyTorch☆224Updated this week
- Fast CUDA matrix multiplication from scratch☆632Updated last year
- A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores☆48Updated last year
- Collection of kernels written in Triton language☆103Updated this week