lucasdelimanogueira / PyNorch
Recreating PyTorch from scratch (C/C++, CUDA and Python, with multi-GPU support and automatic differentiation!)
☆89Updated 3 months ago
Related projects: ⓘ
- Simple Byte pair Encoding mechanism used for tokenization process . written purely in C☆115Updated 2 months ago
- Alex Krizhevsky's original code from Google Code☆185Updated 8 years ago
- From zero to hero CUDA for accelerating maths and machine learning on GPU.☆167Updated last month
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆152Updated 11 months ago
- The Tensor (or Array)☆388Updated last month
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆452Updated last week
- An ML Systems Onboarding list☆491Updated last month
- Implementation of Diffusion Transformer (DiT) in JAX☆246Updated 3 months ago
- A c/c++ implementation of micrograd: a tiny autograd engine with neural net on top.☆60Updated 11 months ago
- Documented and Unit Tested educational Deep Learning framework with Autograd from scratch.☆103Updated 5 months ago
- Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation: https://www.youtube.com/watch?v=vAmKB7iPkWw☆226Updated last month
- CUDA Learning guide☆203Updated 3 months ago
- UNet diffusion model in pure CUDA☆562Updated 2 months ago
- ring-attention experiments☆89Updated 5 months ago
- LLaMA 2 implemented from scratch in PyTorch☆216Updated 11 months ago
- ☆124Updated 7 months ago
- A really tiny autograd engine☆85Updated 5 months ago
- Cataloging released Triton kernels.☆111Updated 3 weeks ago
- Slides, notes, and materials for the workshop☆297Updated 3 months ago
- The Autograd Engine☆482Updated last week
- a Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization in pure C.☆21Updated 2 months ago
- The Multilayer Perceptron Language Model☆504Updated last month
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆88Updated 11 months ago
- Inference of Mamba models in pure C☆176Updated 6 months ago
- LLM training in simple, raw C/CUDA☆79Updated 4 months ago
- Well documented, unit tested, type checked and formatted implementation of a vanilla transformer - for educational purposes.☆211Updated 5 months ago
- ☆124Updated last week
- Fast multi-threaded matrix multiplication in C☆164Updated 3 weeks ago
- ☆52Updated last week
- Mixed precision training from scratch with Tensors and CUDA☆18Updated 4 months ago