andylolu2 / cuda-mnistLinks

Training MLP on MNIST in 1.5 seconds with pure CUDA

☆46

Alternatives and similar repositories for cuda-mnist

Users that are interested in cuda-mnist are comparing it to the libraries listed below

Sorting:

linjames0 / Transformer-CUDA
An implementation of the transformer architecture onto an Nvidia CUDA kernel
☆195Updated 2 years ago
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆151Updated 2 years ago
jax-ml / ml_dtypes
A stand-alone implementation of several NumPy dtype extensions used in machine learning.
☆312Updated this week
andreinechaev / nvcc4jupyter
A plugin for Jupyter Notebook to run CUDA C/C++ code
☆255Updated last year
tgautam03 / xGeMM
Accelerated General (FP32) Matrix Multiplication from scratch in CUDA
☆169Updated 10 months ago
eduardoleao052 / Autograd-from-scratch
Documented and Unit Tested educational Deep Learning framework with Autograd from scratch.
☆122Updated last year
gpu-mode / profiling-cuda-in-torch
☆177Updated last year
BobMcDear / neural-network-cuda
Neural network from scratch in CUDA/C++
☆87Updated 2 months ago
lessw2020 / triton_kernels_for_fun_and_profit
Custom kernels in Triton language for accelerating LLMs
☆27Updated last year
gpu-mode / discord-cluster-manager
Write a fast kernel and run it on Discord. See how you compare against the best!
☆61Updated last week
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆112Updated 10 months ago
lucidrains / flash-attention-jax
Implementation of Flash Attention in Jax
☆222Updated last year
HazyResearch / flash-fft-conv
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
☆333Updated 11 months ago
gpu-mode / reference-kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
☆160Updated 2 weeks ago
stas00 / ml-ways
ML/DL Math and Method notes
☆64Updated last year
hidet-org / hidet
An open-source efficient deep learning framework/compiler, written in python.
☆736Updated 2 months ago
IST-DASLab / llmq
Quantized LLM training in pure CUDA/C++.
☆218Updated this week
HenryNdubuaku / nanodl
A Jax-based library for building transformers, includes implementations of GPT, Gemma, LlaMa, Mixtral, Whisper, SWin, ViT and more.
☆297Updated last year
CisMine / Guide-NVIDIA-Tools
NVIDIA tools guide
☆149Updated 10 months ago
mlcommons / algorithmic-efficiency
MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvement…
☆401Updated this week
kshitij12345 / torchnnprofiler
Context Manager to profile the forward and backward times of PyTorch's nn.Module
☆83Updated 2 years ago
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆583Updated 3 months ago
salykova / sgemm.c
Multi-Threaded FP32 Matrix Multiplication on x86 CPUs
☆367Updated 7 months ago
andylolu2 / simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
☆39Updated last year
SzymonOzog / FastSoftmax
Step by step implementation of a fast softmax kernel in CUDA
☆57Updated 10 months ago
vdesai2014 / inference-optimization-blog-post
☆90Updated last year
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆401Updated last week
lucidrains / triton-transformer
Implementation of a Transformer, but completely in Triton
☆277Updated 3 years ago
triton-inference-server / pytorch_backend
The Triton backend for the PyTorch TorchScript models.
☆165Updated last week
numba / nvidia-cuda-tutorial
Nvidia contributed CUDA tutorial for Numba
☆262Updated 3 years ago