SzymonOzog / GPU_Programming
☆18Updated last week
Related projects ⓘ
Alternatives and complementary repositories for GPU_Programming
- Triton implementation of GPT/LLAMA☆15Updated 2 months ago
- LLM training in simple, raw C/CUDA☆86Updated 6 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆103Updated last year
- Tensor library with autograd using only Rust's standard library☆62Updated 4 months ago
- ☆133Updated 9 months ago
- ☆145Updated this week
- extensible collectives library in triton☆65Updated last month
- Alex Krizhevsky's original code from Google Code☆189Updated 8 years ago
- Cataloging released Triton kernels.☆133Updated 2 months ago
- Learn CUDA with PyTorch☆14Updated this week
- Collection of kernels written in Triton language☆63Updated 2 weeks ago
- Accelerated First Order Parallel Associative Scan☆162Updated 2 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆479Updated 2 weeks ago
- Simple Byte pair Encoding mechanism used for tokenization process . written purely in C☆120Updated this week
- ☆82Updated 8 months ago
- Simple and fast low-bit matmul kernels in CUDA / Triton☆140Updated this week
- tenstorrent kernel from twitch☆27Updated 7 months ago
- Learning about CUDA by writing PTX code.☆28Updated 8 months ago
- The simplest but fast implementation of matrix multiplication in CUDA.☆33Updated 3 months ago
- Applied AI experiments and examples for PyTorch☆160Updated last week
- Attention in SRAM on Tenstorrent Grayskull☆29Updated 3 months ago
- Custom kernels in Triton language for accelerating LLMs☆17Updated 7 months ago
- ☆32Updated 5 months ago
- This repository contains the experimental PyTorch native float8 training UX☆211Updated 3 months ago
- seqax = sequence modeling + JAX☆132Updated 3 months ago
- pytorch from scratch in pure C/CUDA and python☆34Updated last month
- ☆266Updated this week
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆268Updated this week
- ☆197Updated 3 months ago
- a Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization in pure C.☆21Updated 4 months ago