andylolu2 / cuda-mnistLinks
Training MLP on MNIST in 1.5 seconds with pure CUDA
☆46Updated 10 months ago
Alternatives and similar repositories for cuda-mnist
Users that are interested in cuda-mnist are comparing it to the libraries listed below
Sorting:
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆190Updated 2 years ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆142Updated last year
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆299Updated last week
- ☆172Updated last year
- High-Performance SGEMM on CUDA devices☆103Updated 8 months ago
- A plugin for Jupyter Notebook to run CUDA C/C++ code☆244Updated last year
- Custom kernels in Triton language for accelerating LLMs☆25Updated last year
- Documented and Unit Tested educational Deep Learning framework with Autograd from scratch.☆120Updated last year
- Learn CUDA with PyTorch☆84Updated this week
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆356Updated 5 months ago
- An open-source efficient deep learning framework/compiler, written in python.☆728Updated 3 weeks ago
- NVIDIA tools guide☆142Updated 8 months ago
- The simplest but fast implementation of matrix multiplication in CUDA.☆39Updated last year
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆576Updated last month
- ☆240Updated this week
- LLM training in simple, raw C/CUDA☆104Updated last year
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆96Updated 2 weeks ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆194Updated 3 months ago
- Accelerated General (FP32) Matrix Multiplication from scratch in CUDA☆139Updated 8 months ago
- Slides, notes, and materials for the workshop☆331Updated last year
- Fastest kernels written from scratch☆355Updated last week
- NVIDIA curated collection of educational resources related to general purpose GPU programming.☆716Updated this week
- Fast low-bit matmul kernels in Triton☆373Updated this week
- Simple MPI implementation for prototyping or learning☆279Updated last month
- Cataloging released Triton kernels.☆260Updated 2 weeks ago
- Learning about CUDA by writing PTX code.☆135Updated last year
- Alex Krizhevsky's original code from Google Code☆198Updated 9 years ago
- Experiment of using Tangent to autodiff triton☆81Updated last year
- UNet diffusion model in pure CUDA☆647Updated last year
- MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvement…☆398Updated this week