andylolu2 / cuda-mnistLinks
Training MLP on MNIST in 1.5 seconds with pure CUDA
☆46Updated last year
Alternatives and similar repositories for cuda-mnist
Users that are interested in cuda-mnist are comparing it to the libraries listed below
Sorting:
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆195Updated 2 years ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆151Updated 2 years ago
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆312Updated this week
- A plugin for Jupyter Notebook to run CUDA C/C++ code☆255Updated last year
- Accelerated General (FP32) Matrix Multiplication from scratch in CUDA☆169Updated 10 months ago
- Documented and Unit Tested educational Deep Learning framework with Autograd from scratch.☆122Updated last year
- ☆177Updated last year
- Neural network from scratch in CUDA/C++☆87Updated 2 months ago
- Custom kernels in Triton language for accelerating LLMs☆27Updated last year
- Write a fast kernel and run it on Discord. See how you compare against the best!☆61Updated last week
- High-Performance SGEMM on CUDA devices☆112Updated 10 months ago
- Implementation of Flash Attention in Jax☆222Updated last year
- FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores☆333Updated 11 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆160Updated 2 weeks ago
- ML/DL Math and Method notes☆64Updated last year
- An open-source efficient deep learning framework/compiler, written in python.☆736Updated 2 months ago
- Quantized LLM training in pure CUDA/C++.☆218Updated this week
- A Jax-based library for building transformers, includes implementations of GPT, Gemma, LlaMa, Mixtral, Whisper, SWin, ViT and more.☆297Updated last year
- NVIDIA tools guide☆149Updated 10 months ago
- MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvement…☆401Updated this week
- Context Manager to profile the forward and backward times of PyTorch's nn.Module☆83Updated 2 years ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆583Updated 3 months ago
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆367Updated 7 months ago
- The simplest but fast implementation of matrix multiplication in CUDA.☆39Updated last year
- Step by step implementation of a fast softmax kernel in CUDA☆57Updated 10 months ago
- ☆90Updated last year
- Fast low-bit matmul kernels in Triton☆401Updated last week
- Implementation of a Transformer, but completely in Triton☆277Updated 3 years ago
- The Triton backend for the PyTorch TorchScript models.☆165Updated last week
- Nvidia contributed CUDA tutorial for Numba☆262Updated 3 years ago