andylolu2 / cuda-mnist
Training MLP on MNIST in 1.5 seconds with pure CUDA
☆46Updated 6 months ago
Alternatives and similar repositories for cuda-mnist
Users that are interested in cuda-mnist are comparing it to the libraries listed below
Sorting:
- High-Performance SGEMM on CUDA devices☆91Updated 3 months ago
- ☆155Updated last year
- The simplest but fast implementation of matrix multiplication in CUDA.☆35Updated 9 months ago
- ☆52Updated this week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆131Updated last year
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆181Updated last year
- Fastest kernels written from scratch☆261Updated last month
- Tutorials for Triton, a language for writing gpu kernels☆14Updated last year
- LLM training in simple, raw C/CUDA☆94Updated last year
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆262Updated this week
- Qualcomm Cloud AI SDK (Platform and Apps) enable high performance deep learning inference on Qualcomm Cloud AI platforms delivering high …☆60Updated 6 months ago
- Step-by-step optimization of CUDA SGEMM☆315Updated 3 years ago
- Documented and Unit Tested educational Deep Learning framework with Autograd from scratch.☆111Updated last year
- CUDA Matrix Multiplication Optimization☆186Updated 9 months ago
- Learning about CUDA by writing PTX code.☆129Updated last year
- An open-source efficient deep learning framework/compiler, written in python.☆698Updated 2 months ago
- Mixed precision training from scratch with Tensors and CUDA☆22Updated last year
- ☆32Updated 4 months ago
- ☆204Updated 2 weeks ago
- Reference Kernels for the Leaderboard☆43Updated this week
- ☆88Updated last year
- Cataloging released Triton kernels.☆220Updated 4 months ago
- Fast CUDA matrix multiplication from scratch☆709Updated last year
- Learn CUDA with PyTorch☆20Updated 3 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆536Updated this week
- Custom kernels in Triton language for accelerating LLMs☆19Updated last year
- Fast low-bit matmul kernels in Triton☆299Updated this week
- NVIDIA tools guide☆132Updated 4 months ago
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆349Updated 3 weeks ago
- This repository contains the experimental PyTorch native float8 training UX☆224Updated 9 months ago