linjames0 / Transformer-CUDA
An implementation of the transformer architecture onto an Nvidia CUDA kernel
☆157Updated last year
Related projects ⓘ
Alternatives and complementary repositories for Transformer-CUDA
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆107Updated last year
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆483Updated 3 weeks ago
- Solve puzzles. Learn CUDA.☆61Updated 11 months ago
- ☆133Updated 9 months ago
- ☆152Updated this week
- Cataloging released Triton kernels.☆134Updated 2 months ago
- This repository contains the experimental PyTorch native float8 training UX☆211Updated 3 months ago
- Applied AI experiments and examples for PyTorch☆166Updated 2 weeks ago
- Simple and fast low-bit matmul kernels in CUDA / Triton☆143Updated this week
- Mixed precision training from scratch with Tensors and CUDA☆20Updated 6 months ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆350Updated 8 months ago
- A really tiny autograd engine☆87Updated 7 months ago
- Flash Attention in ~100 lines of CUDA (forward pass only)☆626Updated 7 months ago
- ring-attention experiments☆97Updated last month
- Puzzles for exploring transformers☆325Updated last year
- ☆267Updated this week
- Alex Krizhevsky's original code from Google Code☆190Updated 8 years ago
- Experiment of using Tangent to autodiff triton☆72Updated 9 months ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆193Updated this week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆187Updated this week
- Learning about CUDA by writing PTX code.☆28Updated 8 months ago
- Simple Transformer in Jax☆119Updated 4 months ago
- Implementation of Flash Attention in Jax☆196Updated 8 months ago
- JAX implementation of the Llama 2 model☆210Updated 9 months ago
- extensible collectives library in triton☆71Updated last month
- An interactive exploration of Transformer programming.☆246Updated last year
- Helpful tools and examples for working with flex-attention☆469Updated 3 weeks ago
- Annotated version of the Mamba paper☆457Updated 8 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆262Updated last year
- seqax = sequence modeling + JAX☆133Updated 4 months ago