apoorvnandan / lilgradLinks
pytorch from scratch in pure C/CUDA and python
☆41Updated last year
Alternatives and similar repositories for lilgrad
Users that are interested in lilgrad are comparing it to the libraries listed below
Sorting:
- Learning about CUDA by writing PTX code.☆150Updated last year
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 8 months ago
- Quantized LLM training in pure CUDA/C++.☆221Updated this week
- ☆81Updated this week
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆370Updated 8 months ago
- Learnings and programs related to CUDA☆428Updated 5 months ago
- small auto-grad engine inspired from Karpathy's micrograd and PyTorch☆277Updated last year
- (WIP) A small but powerful, homemade PyTorch from scratch.☆662Updated this week
- Alex Krizhevsky's original code from Google Code☆197Updated 9 years ago
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Updated 6 months ago
- in this repository, i'm going to implement increasingly complex llm inference optimizations☆75Updated 6 months ago
- Accelerated General (FP32) Matrix Multiplication from scratch in CUDA☆174Updated 11 months ago
- A really tiny autograd engine☆96Updated 6 months ago
- coding CUDA everyday!☆71Updated last week
- could we make an ml stack in 100,000 lines of code?☆46Updated last year
- ☆86Updated last month
- Andrej Kapathy's micrograd implemented in c☆30Updated last year
- An implement of deep learning framework and models in C☆48Updated 8 months ago
- GPT-2 in C☆77Updated 11 months ago
- Solve puzzles to improve your tinygrad skills!☆167Updated 2 months ago
- LLM training in simple, raw C/CUDA☆108Updated last year
- Custom PTX Instruction Benchmark☆136Updated 9 months ago
- High-Performance SGEMM on CUDA devices☆113Updated 10 months ago
- Following Karpathy with GPT-2 implementation and training, writing lots of comments cause I have memory of a goldfish☆172Updated last year
- SIMD quantization kernels☆93Updated 3 months ago
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆441Updated 9 months ago
- Simple Byte pair Encoding mechanism used for tokenization process . written purely in C☆139Updated last year
- Notes on "Programming Massively Parallel Processors" by Hwu, Kirk, and Hajj (4th ed.)☆53Updated last year
- Simple MPI implementation for prototyping or learning☆292Updated 4 months ago
- A high-performance attention mechanism that computes softmax normalization in a single streaming pass using running accumulators (online …☆28Updated 2 months ago