gau-nernst / learn-cuda
Learn CUDA with PyTorch
β14Updated 2 weeks ago
Related projects β
Alternatives and complementary repositories for learn-cuda
- β73Updated 4 months ago
- Large scale 4D parallelism pre-training for π€ transformers in Mixture of Experts *(still work in progress)*β80Updated 11 months ago
- ring-attention experimentsβ97Updated last month
- A place to store reusable transformer components of my own creation or found on the interwebsβ44Updated 2 weeks ago
- Various transformers for FSDP researchβ33Updated 2 years ago
- β77Updated 5 months ago
- β17Updated last month
- Collection of autoregressive model implementationβ67Updated this week
- Experiment of using Tangent to autodiff tritonβ72Updated 10 months ago
- Simple and fast low-bit matmul kernels in CUDA / Tritonβ147Updated this week
- β20Updated last year
- Cataloging released Triton kernels.β138Updated 2 months ago
- β13Updated last year
- Mixed precision training from scratch with Tensors and CUDAβ20Updated 6 months ago
- See https://github.com/cuda-mode/triton-index/ instead!β11Updated 6 months ago
- β19Updated 7 months ago
- ML/DL Math and Method notesβ57Updated 11 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.β66Updated 5 months ago
- An implementation of the Llama architecture, to instruct and delightβ21Updated 3 months ago
- Flash Attention Implementation with Multiple Backend Support and Sharding This module provides a flexible implementation of Flash Attentiβ¦β18Updated last week
- β20Updated 2 years ago
- β153Updated this week
- Make triton easierβ41Updated 5 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.β35Updated 4 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top ofβ¦β87Updated 3 months ago
- a minimal cache manager for PagedAttention, on top of llama3.β46Updated 2 months ago
- extensible collectives library in tritonβ72Updated 2 months ago
- β133Updated 9 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.β107Updated last year
- Google TPU optimizations for transformers modelsβ75Updated this week