linjames0 / Transformer-CUDA
An implementation of the transformer architecture onto an Nvidia CUDA kernel
☆174Updated last year
Alternatives and similar repositories for Transformer-CUDA:
Users that are interested in Transformer-CUDA are comparing it to the libraries listed below
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆126Updated last year
- ☆151Updated last year
- Cataloging released Triton kernels.☆204Updated 2 months ago
- ☆191Updated this week
- Solve puzzles. Learn CUDA.☆63Updated last year
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆524Updated last month
- ☆136Updated 2 months ago
- Collection of kernels written in Triton language☆114Updated last month
- Fastest kernels written from scratch☆199Updated 2 weeks ago
- Fast low-bit matmul kernels in Triton☆267Updated this week
- Applied AI experiments and examples for PyTorch☆249Updated this week
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems☆234Updated this week
- Learning about CUDA by writing PTX code.☆124Updated last year
- CUDA Matrix Multiplication Optimization☆173Updated 8 months ago
- This repository contains the experimental PyTorch native float8 training UX☆222Updated 7 months ago
- Alex Krizhevsky's original code from Google Code☆190Updated 9 years ago
- Experiment of using Tangent to autodiff triton☆78Updated last year
- The simplest but fast implementation of matrix multiplication in CUDA.☆34Updated 7 months ago
- ring-attention experiments☆127Updated 5 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆232Updated 3 weeks ago
- Simple Byte pair Encoding mechanism used for tokenization process . written purely in C☆129Updated 4 months ago
- A really tiny autograd engine☆90Updated 11 months ago
- ☆86Updated last year
- ☆290Updated this week
- Fast CUDA matrix multiplication from scratch☆663Updated last year
- Step-by-step optimization of CUDA SGEMM☆293Updated 2 years ago
- High-Performance SGEMM on CUDA devices☆86Updated 2 months ago