SzymonOzog / GPU_Programming
☆19Updated 3 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for GPU_Programming
- Triton implementation of GPT/LLAMA☆16Updated 2 months ago
- Alex Krizhevsky's original code from Google Code☆190Updated 8 years ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆107Updated last year
- LLM training in simple, raw C/CUDA☆86Updated 6 months ago
- ☆133Updated 9 months ago
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆157Updated last year
- Accelerated First Order Parallel Associative Scan☆164Updated 3 months ago
- Solve puzzles. Learn CUDA.☆61Updated 11 months ago
- ☆197Updated 4 months ago
- Cataloging released Triton kernels.☆134Updated 2 months ago
- Mixed precision training from scratch with Tensors and CUDA☆20Updated 6 months ago
- UNet diffusion model in pure CUDA☆584Updated 4 months ago
- pytorch from scratch in pure C/CUDA and python☆37Updated last month
- Tensor library with autograd using only Rust's standard library☆62Updated 4 months ago
- extensible collectives library in triton☆72Updated last month
- ☆52Updated 11 months ago
- ☆32Updated 5 months ago
- Learning about CUDA by writing PTX code.☆28Updated 8 months ago
- Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)☆114Updated 5 months ago
- Experiment of using Tangent to autodiff triton☆72Updated 9 months ago
- ☆153Updated this week
- The simplest but fast implementation of matrix multiplication in CUDA.☆33Updated 3 months ago
- JAX bindings for Flash Attention v2☆79Updated 4 months ago
- Implementation of Diffusion Transformer (DiT) in JAX☆252Updated 5 months ago
- ☆82Updated 8 months ago
- LLM training in simple, raw C/CUDA☆17Updated 6 months ago
- Deep learning library implemented from scratch in numpy. Mixtral, Mamba, LLaMA, GPT, ResNet, and other experiments.☆48Updated 7 months ago
- seqax = sequence modeling + JAX☆133Updated 4 months ago
- Collection of kernels written in Triton language☆68Updated 3 weeks ago
- This repository contains the experimental PyTorch native float8 training UX☆211Updated 3 months ago