ShlokVFX / 100-days-cudaLinks
This repository documents my 100-day journey of learning and writing CUDA kernels.
☆12Updated 3 weeks ago
Alternatives and similar repositories for 100-days-cuda
Users that are interested in 100-days-cuda are comparing it to the libraries listed below
Sorting:
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆196Updated 2 months ago
- Cataloging released Triton kernels.☆245Updated 6 months ago
- Fastest kernels written from scratch☆290Updated 3 months ago
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆378Updated 4 months ago
- Applied AI experiments and examples for PyTorch☆281Updated last month
- Fast low-bit matmul kernels in Triton☆330Updated last week
- 100 days of building GPU kernels!☆462Updated 2 months ago
- Perplexity GPU Kernels☆395Updated last month
- ☆179Updated 6 months ago
- ☆216Updated last year
- ☆225Updated this week
- A curated list of awesome projects and papers for distributed training or inference☆238Updated 9 months ago
- ☆124Updated 2 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆401Updated last month
- CUDA Matrix Multiplication Optimization☆202Updated 11 months ago
- ☆110Updated 4 months ago
- Materials for learning SGLang☆481Updated last week
- CUTLASS and CuTe Examples☆63Updated this week
- GPU Kernels☆188Updated 2 months ago
- ☆84Updated 2 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆184Updated this week
- Collection of kernels written in Triton language☆136Updated 3 months ago
- NVIDIA tools guide☆138Updated 6 months ago
- ☆168Updated 11 months ago
- QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.☆30Updated 4 months ago
- A lightweight design for computation-communication overlap.☆146Updated 3 weeks ago
- kernels, of the mega variety☆441Updated last month
- A Easy-to-understand TensorOp Matmul Tutorial☆365Updated 9 months ago
- Examples of CUDA implementations by Cutlass CuTe☆206Updated 2 weeks ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆255Updated 8 months ago