andylolu2 / cuda-mnistLinks
Training MLP on MNIST in 1.5 seconds with pure CUDA
☆46Updated 11 months ago
Alternatives and similar repositories for cuda-mnist
Users that are interested in cuda-mnist are comparing it to the libraries listed below
Sorting:
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆190Updated 2 years ago
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆301Updated last week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆145Updated 2 years ago
- Documented and Unit Tested educational Deep Learning framework with Autograd from scratch.☆122Updated last year
- ☆174Updated last year
- Learn CUDA with PyTorch☆87Updated 3 weeks ago
- A plugin for Jupyter Notebook to run CUDA C/C++ code☆246Updated last year
- Neural network from scratch in CUDA/C++☆86Updated last month
- Accelerated General (FP32) Matrix Multiplication from scratch in CUDA☆161Updated 9 months ago
- LLM training in simple, raw C/CUDA☆105Updated last year
- High-Performance SGEMM on CUDA devices☆107Updated 8 months ago
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆364Updated 5 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆58Updated this week
- Step by step implementation of a fast softmax kernel in CUDA☆52Updated 9 months ago
- An open-source efficient deep learning framework/compiler, written in python.☆731Updated last month
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆99Updated last week
- Alex Krizhevsky's original code from Google Code☆199Updated 9 years ago
- Qualcomm Cloud AI SDK (Platform and Apps) enable high performance deep learning inference on Qualcomm Cloud AI platforms delivering high …☆67Updated 2 months ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆194Updated 4 months ago
- The simplest but fast implementation of matrix multiplication in CUDA.☆39Updated last year
- Learning about CUDA by writing PTX code.☆143Updated last year
- Competitive GPU kernel optimization platform.☆107Updated last week
- Notebooks for the "Deep Learning with JAX" book☆156Updated 4 months ago
- NVIDIA tools guide☆143Updated 9 months ago
- Custom kernels in Triton language for accelerating LLMs☆26Updated last year
- Slides, notes, and materials for the workshop☆333Updated last year
- Quantized LLM training in pure CUDA/C++.☆198Updated last week
- MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvement…☆400Updated this week
- Experimental GPU language with meta-programming☆23Updated last year
- This is a repository for all workshop related materials.☆231Updated last year