BobMcDear / neural-network-cuda
Neural network from scratch in CUDA/C++
☆78Updated 2 months ago
Alternatives and similar repositories for neural-network-cuda:
Users that are interested in neural-network-cuda are comparing it to the libraries listed below
- Implement Neural Networks in Cuda from Scratch☆22Updated 10 months ago
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆174Updated last year
- CUDA Matrix Multiplication Optimization☆173Updated 8 months ago
- LLM training in simple, raw C/CUDA☆92Updated 10 months ago
- ☆28Updated 2 months ago
- NVIDIA tools guide☆119Updated 2 months ago
- High-Performance SGEMM on CUDA devices☆87Updated 2 months ago
- Fastest kernels written from scratch☆202Updated 3 weeks ago
- Customized matrix multiplication kernels☆53Updated 3 years ago
- A parallel framework for training deep neural networks☆57Updated last week
- ☆152Updated last year
- CUDA Learning guide☆349Updated 9 months ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆127Updated 4 years ago
- Collection of kernels written in Triton language☆114Updated last month
- PyTorch implementation of EfficientNet☆10Updated 2 years ago
- A plugin for Jupyter Notebook to run CUDA C/C++ code☆217Updated 6 months ago
- The simplest but fast implementation of matrix multiplication in CUDA.☆34Updated 8 months ago
- ☆82Updated last week
- ☆73Updated 4 months ago
- Cataloging released Triton kernels.☆208Updated 2 months ago
- Simple neural network implementation using CUDA technology. It is an educational implementation.☆96Updated 6 years ago
- Some CUDA example code with READMEs.☆93Updated 3 weeks ago
- ☆192Updated this week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆153Updated this week
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆66Updated 4 years ago
- Step-by-step optimization of CUDA SGEMM☆294Updated 2 years ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆127Updated last year
- Fast low-bit matmul kernels in Triton☆272Updated this week
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆314Updated this week
- PyTorch implementation of the vision transformer☆18Updated last year