OrangeOwlSolutions / General-CUDA-programming
☆44Updated 7 years ago
Alternatives and similar repositories for General-CUDA-programming
Users that are interested in General-CUDA-programming are comparing it to the libraries listed below
Sorting:
- Some CUDA design patterns and a bit of template magic for CUDA☆151Updated last year
- Example of how to use CUDA with CMake >= 3.8☆69Updated last year
- CNNs in Halide☆23Updated 9 years ago
- Fast integer division with divisor not known at compile time. To be used primarily in CUDA kernels.☆70Updated 9 years ago
- ☆22Updated 7 years ago
- Example code used in the CVPR 2015 tutorial☆40Updated 9 years ago
- Learn OpenCL step by step.☆135Updated 2 years ago
- ☆67Updated 11 years ago
- This example builds on the parallel-forall repo separate compilation example by adding CMake to it.☆17Updated 7 years ago
- flexible-gemm conv of deepcore☆17Updated 5 years ago
- THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.☆84Updated last year
- Utilities for CUDA programming☆40Updated 5 years ago
- portDNN is a library implementing neural network algorithms written using SYCL☆113Updated 11 months ago
- Simple example of implementing a new Tensorflow operation and its gradient in C++.☆56Updated 6 years ago
- Collection of CUDA benchmarks, with a focus on unified vs. explicit memory management.☆20Updated 5 years ago
- Connected Component Labeling.☆43Updated 4 years ago
- Algorithms implemented in CUDA + resources about GPGPU☆56Updated 3 years ago
- BGHT: High-performance static GPU hash tables.☆63Updated last month
- ☆20Updated 6 years ago
- FastHOG library that has been fixed to work with CUDA 5.x on Ubuntu 12.04☆20Updated 11 years ago
- study of cutlass☆21Updated 6 months ago
- This is a tuned sparse matrix dense vector multiplication(SpMV) library☆21Updated 9 years ago
- Source code examples from the Parallel Forall Blog☆96Updated 6 years ago
- How to use CUDA with Python numpy☆38Updated 7 years ago
- CUDA C++ syntax support & snippets for VSCode☆20Updated 4 years ago
- CUDA implementation of the fundamental sum reduce operation. Aims to be as optimized as reasonable.☆37Updated 7 years ago
- Efficient CUDA Stream Compaction Library☆33Updated last year
- implementation of winograd minimal convolution algorithm on Intel Architecture☆39Updated 7 years ago
- Codebase associated with the PyTorch compiler tutorial☆45Updated 5 years ago
- CUDA Sparse-Matrix Vector Multiplication, using Sliced Coordinate format☆21Updated 6 years ago