andylolu2 / cuda-mnist
Training MLP on MNIST in 1.5 seconds with pure CUDA
☆44Updated 3 months ago
Alternatives and similar repositories for cuda-mnist:
Users that are interested in cuda-mnist are comparing it to the libraries listed below
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆118Updated last year
- LLM training in simple, raw C/CUDA☆91Updated 9 months ago
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆169Updated last year
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆252Updated 2 weeks ago
- High-Performance SGEMM on CUDA devices☆74Updated 3 weeks ago
- Fastest kernels written from scratch☆170Updated this week
- Cataloging released Triton kernels.☆168Updated last month
- The simplest but fast implementation of matrix multiplication in CUDA.☆34Updated 6 months ago
- Implementation of Flash Attention in Jax☆204Updated 11 months ago
- ☆142Updated last year
- This repository contains the experimental PyTorch native float8 training UX☆221Updated 6 months ago
- Fast low-bit matmul kernels in Triton☆236Updated this week
- CUDA Matrix Multiplication Optimization☆161Updated 7 months ago
- NVIDIA tools guide☆102Updated last month
- ☆284Updated last week
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆303Updated this week
- High-Performance FP32 Matrix Multiplication on CPU☆333Updated this week
- ☆86Updated 11 months ago
- UNet diffusion model in pure CUDA☆599Updated 7 months ago
- A user-friendly tool chain that enables the seamless execution of ONNX models using JAX as the backend.☆107Updated 3 weeks ago
- ☆369Updated 7 months ago
- An open-source efficient deep learning framework/compiler, written in python.☆681Updated last week
- CUDA Learning guide☆326Updated 8 months ago
- Collection of kernels written in Triton language☆105Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆18Updated this week
- ☆179Updated last week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆514Updated this week
- Alex Krizhevsky's original code from Google Code☆189Updated 8 years ago
- Custom kernels in Triton language for accelerating LLMs☆17Updated 10 months ago
- A c/c++ implementation of micrograd: a tiny autograd engine with neural net on top.☆63Updated last year