gau-nernst / learn-cudaLinks
Learn CUDA with PyTorch
☆176Updated 3 weeks ago
Alternatives and similar repositories for learn-cuda
Users that are interested in learn-cuda are comparing it to the libraries listed below
Sorting:
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆247Updated 8 months ago
- Cataloging released Triton kernels.☆282Updated 4 months ago
- Fast low-bit matmul kernels in Triton☆423Updated 3 weeks ago
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆453Updated 10 months ago
- ☆271Updated this week
- ☆233Updated last year
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆188Updated 3 weeks ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆195Updated 7 months ago
- Fastest kernels written from scratch☆523Updated 3 months ago
- Applied AI experiments and examples for PyTorch☆312Updated 4 months ago
- ring-attention experiments☆161Updated last year
- a minimal cache manager for PagedAttention, on top of llama3.☆130Updated last year
- coding CUDA everyday!☆72Updated last month
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆154Updated 2 years ago
- Helpful kernel tutorials and examples for tile-based GPU programming☆568Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆310Updated this week
- Collection of kernels written in Triton language☆174Updated 9 months ago
- ☆128Updated 2 months ago
- making the official triton tutorials actually comprehensible☆93Updated 4 months ago
- kernels, of the mega variety☆648Updated 3 months ago
- A Quirky Assortment of CuTe Kernels☆749Updated this week
- Accelerating MoE with IO and Tile-aware Optimizations☆542Updated this week
- Simple MPI implementation for prototyping or learning☆299Updated 5 months ago
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆711Updated this week
- Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O☆547Updated 4 months ago
- ☆100Updated last year
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆185Updated this week
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 9 months ago
- TPU inference for vLLM, with unified JAX and PyTorch support.☆213Updated this week
- Learning about CUDA by writing PTX code.☆151Updated last year