lucasdelimanogueira / PyNorch
Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)
☆137Updated 7 months ago
Alternatives and similar repositories for PyNorch:
Users that are interested in PyNorch are comparing it to the libraries listed below
- ☆91Updated 2 weeks ago
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆161Updated this week
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆167Updated last year
- The Tensor (or Array)☆418Updated 5 months ago
- ring-attention experiments☆116Updated 3 months ago
- Cataloging released Triton kernels.☆155Updated last week
- Alex Krizhevsky's original code from Google Code☆190Updated 8 years ago
- ☆138Updated 11 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆505Updated 2 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆114Updated last year
- Simple Byte pair Encoding mechanism used for tokenization process . written purely in C☆122Updated 2 months ago
- Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O☆211Updated this week
- From zero to hero CUDA for accelerating maths and machine learning on GPU.☆175Updated 5 months ago
- UNet diffusion model in pure CUDA☆596Updated 6 months ago
- ☆170Updated this week
- ☆104Updated last week
- LLaMA 2 implemented from scratch in PyTorch☆277Updated last year
- Minimalistic 4D-parallelism distributed training framework for education purpose☆644Updated this week
- The simplest but fast implementation of matrix multiplication in CUDA.☆34Updated 5 months ago
- Slides, notes, and materials for the workshop☆309Updated 7 months ago
- LoRA and DoRA from Scratch Implementations☆194Updated 10 months ago
- The Autograd Engine☆550Updated 4 months ago
- Distributed training (multi-node) of a Transformer model☆49Updated 9 months ago
- FlashRNN - Fast RNN Kernels with I/O Awareness☆69Updated last month
- The Multilayer Perceptron Language Model☆532Updated 5 months ago
- Fastest kernels written from scratch☆118Updated last month
- Fast low-bit matmul kernels in Triton☆187Updated last week
- Notes on quantization in neural networks☆63Updated last year
- Inference Vision Transformer (ViT) in plain C/C++ with ggml☆244Updated 9 months ago
- Implementation of Diffusion Transformer (DiT) in JAX☆261Updated 7 months ago