BobMcDear / neural-network-cuda
Neural network from scratch in CUDA/C++
☆69Updated last year
Related projects ⓘ
Alternatives and complementary repositories for neural-network-cuda
- Simple neural network implementation using CUDA technology. It is an educational implementation.☆95Updated 6 years ago
- NVIDIA tools guide☆71Updated 3 months ago
- CUDA Matrix Multiplication Optimization☆141Updated 4 months ago
- LLM training in simple, raw C/CUDA☆87Updated 6 months ago
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆271Updated this week
- Implement Neural Networks in Cuda from Scratch☆22Updated 6 months ago
- ☆153Updated this week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆107Updated last year
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆117Updated 4 years ago
- Training material for Nsight developer tools☆129Updated 3 months ago
- ☆18Updated 2 years ago
- Memory Optimizations for Deep Learning (ICML 2023)☆60Updated 8 months ago
- ☆55Updated 6 months ago
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆157Updated last year
- Customized matrix multiplication kernels☆53Updated 2 years ago
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆68Updated last year
- Serial and parallel implementations of matrix multiplication☆35Updated 3 years ago
- Examples from Programming in Parallel with CUDA☆108Updated last year
- ☆45Updated 2 weeks ago
- Step-by-step optimization of CUDA SGEMM☆243Updated 2 years ago
- ☆133Updated 9 months ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆59Updated 6 years ago
- Fast CUDA matrix multiplication from scratch☆482Updated 10 months ago
- ☆169Updated 4 months ago
- Cataloging released Triton kernels.☆138Updated 2 months ago
- The simplest but fast implementation of matrix multiplication in CUDA.☆33Updated 3 months ago
- PyTorch implementation of the vision transformer☆19Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆35Updated 4 months ago
- End to End steps for adding custom ops in PyTorch.☆19Updated 4 years ago
- ☆48Updated 8 months ago