BobMcDear / neural-network-cuda
Neural network from scratch in CUDA/C++
☆68Updated last year
Related projects ⓘ
Alternatives and complementary repositories for neural-network-cuda
- Simple neural network implementation using CUDA technology. It is an educational implementation.☆93Updated 6 years ago
- Implement Neural Networks in Cuda from Scratch☆22Updated 5 months ago
- CUDA Matrix Multiplication Optimization☆139Updated 3 months ago
- NVIDIA tools guide☆71Updated 2 months ago
- LLM training in simple, raw C/CUDA☆86Updated 6 months ago
- A set of hands-on tutorials for CUDA programming☆193Updated 7 months ago
- A plugin for Jupyter Notebook to run CUDA C/C++ code☆200Updated last month
- Customized matrix multiplication kernels☆53Updated 2 years ago
- Memory Optimizations for Deep Learning (ICML 2023)☆59Updated 7 months ago
- ☆144Updated this week
- Learning about CUDA by writing PTX code.☆28Updated 8 months ago
- Fast CUDA matrix multiplication from scratch☆471Updated 10 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆95Updated last year
- ☆18Updated 2 years ago
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆157Updated last year
- Serial and parallel implementations of matrix multiplication☆35Updated 3 years ago
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆65Updated last year
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆268Updated this week
- ☆133Updated 9 months ago
- Cataloging released Triton kernels.☆132Updated 2 months ago
- ☆30Updated 4 years ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆87Updated 3 months ago
- The simplest but fast implementation of matrix multiplication in CUDA.☆32Updated 3 months ago
- CUDA Learning guide☆239Updated 4 months ago
- Examples from Programming in Parallel with CUDA☆107Updated last year
- Introduction to CUDA programming☆113Updated 7 years ago
- Step-by-step optimization of CUDA SGEMM☆225Updated 2 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆114Updated 4 years ago
- ☆162Updated 3 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆35Updated 5 months ago