andylolu2 / cuda-mnistLinks
Training MLP on MNIST in 1.5 seconds with pure CUDA
☆47Updated 8 months ago
Alternatives and similar repositories for cuda-mnist
Users that are interested in cuda-mnist are comparing it to the libraries listed below
Sorting:
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆136Updated last year
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆188Updated last year
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆281Updated 2 weeks ago
- ☆162Updated last year
- Documented and Unit Tested educational Deep Learning framework with Autograd from scratch.☆117Updated last year
- High-Performance SGEMM on CUDA devices☆98Updated 6 months ago
- The simplest but fast implementation of matrix multiplication in CUDA.☆37Updated 11 months ago
- Implementation of Flash Attention in Jax☆214Updated last year
- A plugin for Jupyter Notebook to run CUDA C/C++ code☆237Updated 10 months ago
- Custom kernels in Triton language for accelerating LLMs☆23Updated last year
- ☆322Updated 3 weeks ago
- ☆47Updated 6 months ago
- FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores☆323Updated 6 months ago
- Simple MPI implementation for prototyping or learning☆267Updated this week
- LLM training in simple, raw C/CUDA☆101Updated last year
- Various transformers for FSDP research☆37Updated 2 years ago
- Notebooks for the "Deep Learning with JAX" book☆151Updated last month
- MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvement…☆389Updated last week
- ML/DL Math and Method notes☆61Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆188Updated last month
- ☆125Updated last year
- Accelerated General (FP32) Matrix Multiplication from scratch in CUDA☆120Updated 6 months ago
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆350Updated 3 months ago
- NVIDIA tools guide☆140Updated 6 months ago
- Memory Optimizations for Deep Learning (ICML 2023)☆102Updated last year
- A Jax-based library for building transformers, includes implementations of GPT, Gemma, LlaMa, Mixtral, Whisper, SWin, ViT and more.☆291Updated 10 months ago
- Learn CUDA with PyTorch☆32Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆47Updated this week
- Tutorials for Triton, a language for writing gpu kernels☆30Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆224Updated 11 months ago