andylolu2 / cuda-mnist
Training MLP on MNIST in 1.5 seconds with pure CUDA
☆44Updated 5 months ago
Alternatives and similar repositories for cuda-mnist:
Users that are interested in cuda-mnist are comparing it to the libraries listed below
- High-Performance SGEMM on CUDA devices☆90Updated 3 months ago
- Custom kernels in Triton language for accelerating LLMs☆18Updated last year
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆257Updated 3 weeks ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆131Updated last year
- The simplest but fast implementation of matrix multiplication in CUDA.☆34Updated 8 months ago
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆179Updated last year
- Fast low-bit matmul kernels in Triton☆291Updated this week
- Fastest kernels written from scratch☆236Updated 2 weeks ago
- NVIDIA tools guide☆127Updated 3 months ago
- ☆31Updated 3 months ago
- ☆200Updated this week
- Cataloging released Triton kernels.☆217Updated 3 months ago