adam-maj / tiny-gpu
A minimal GPU design in Verilog to learn how GPUs work from the ground up
☆7,070Updated 2 months ago
Related projects ⓘ
Alternatives and complementary repositories for tiny-gpu
- A lightweight library for portable low-level GPU computation using WebGPU.☆3,746Updated last week
- LLM training in simple, raw C/CUDA☆24,356Updated last month
- Material for gpu-mode lectures☆2,967Updated this week
- Video+code lecture on building nanoGPT from scratch☆3,580Updated 2 months ago
- Inference Llama 2 in one file of pure C☆17,451Updated 3 months ago
- Blazingly fast LLM inference.☆4,418Updated this week
- Implementation for MatMul-free LM.☆2,918Updated last week
- Solve puzzles. Learn CUDA.☆9,861Updated 2 months ago
- A nanoGPT pipeline packed in a spreadsheet☆2,046Updated 4 months ago
- Tile primitives for speedy kernels☆1,643Updated this week
- Run PyTorch LLMs locally on servers, desktop and mobile☆3,360Updated this week
- Distributed LLM and StableDiffusion inference for mobile, desktop and server.☆2,610Updated 2 weeks ago
- llama3 implementation one matrix multiplication at a time☆13,684Updated 5 months ago
- Solve puzzles. Improve your pytorch.☆3,267Updated 3 months ago
- nanoGPT style version of Llama 3.1☆1,231Updated 3 months ago
- A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API☆10,450Updated 3 months ago
- The n-gram Language Model☆1,337Updated 3 months ago
- Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.☆9,171Updated 4 months ago
- High-speed Large Language Model Serving on PCs with Consumer-grade GPUs☆7,955Updated 2 months ago
- A native PyTorch Library for large model training☆2,586Updated last week
- The official PyTorch implementation of Google's Gemma models☆5,284Updated 3 months ago
- PyTorch native finetuning library☆4,283Updated this week
- Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.☆5,661Updated 3 weeks ago
- lightweight, standalone C++ inference engine for Google's Gemma models.☆5,985Updated this week
- A Python framework for high performance GPU simulation and graphics☆4,234Updated this week
- From the Tensor to Stable Diffusion, a rough outline for a 9 week course.☆1,023Updated 6 months ago
- Efficient Triton Kernels for LLM Training☆3,401Updated this week
- ☆998Updated 3 weeks ago
- LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve spee…☆2,538Updated last month
- 20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.☆10,635Updated this week