andylolu2 / cuda-mnistLinks
Training MLP on MNIST in 1.5 seconds with pure CUDA
☆46Updated last year
Alternatives and similar repositories for cuda-mnist
Users that are interested in cuda-mnist are comparing it to the libraries listed below
Sorting:
- Documented and Unit Tested educational Deep Learning framework with Autograd from scratch.☆122Updated last year
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆327Updated 3 weeks ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆155Updated 2 years ago
- A plugin for Jupyter Notebook to run CUDA C/C++ code☆257Updated last year
- ☆178Updated last year
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆202Updated 2 years ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆68Updated last week
- Neural network from scratch in CUDA/C++☆88Updated 4 months ago
- Nvidia contributed CUDA tutorial for Numba☆265Updated 3 years ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆197Updated 8 months ago
- ML/DL Math and Method notes☆66Updated 2 years ago
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- LLM training in simple, raw C/CUDA☆112Updated last year
- Custom kernels in Triton language for accelerating LLMs☆27Updated last year
- Learn CUDA with PyTorch☆185Updated last week
- The simplest but fast implementation of matrix multiplication in CUDA.☆39Updated last year
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆376Updated 9 months ago
- Learning about CUDA by writing PTX code.☆151Updated last year
- Notebooks for the "Deep Learning with JAX" book☆167Updated 7 months ago
- Accelerated General (FP32) Matrix Multiplication from scratch in CUDA☆181Updated last year
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆200Updated this week
- TorchFix - a linter for PyTorch-using code with autofix support☆152Updated 5 months ago
- Experiment of using Tangent to autodiff triton☆82Updated 2 years ago
- Tutorials for Triton, a language for writing gpu kernels☆72Updated 2 years ago
- FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores☆340Updated last year
- Simple MPI implementation for prototyping or learning☆300Updated 5 months ago
- Slides, notes, and materials for the workshop☆339Updated last year
- Puzzles for exploring transformers☆384Updated 2 years ago
- MoE training for Me and You and maybe other people☆331Updated 3 weeks ago
- MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvement…☆406Updated this week