aryagxr / cudaLinks
coding CUDA everyday!
☆69Updated last week
Alternatives and similar repositories for cuda
Users that are interested in cuda are comparing it to the libraries listed below
Sorting:
- Quantized LLM training in pure CUDA/C++.☆215Updated this week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆242Updated 6 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 7 months ago
- Learn CUDA with PyTorch☆104Updated last week
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆428Updated 8 months ago
- Learning about CUDA by writing PTX code.☆147Updated last year
- making the official triton tutorials actually comprehensible☆61Updated 2 months ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆195Updated 5 months ago
- ☆123Updated 3 weeks ago
- ☆216Updated 10 months ago
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆70Updated last week
- Learnings and programs related to CUDA☆426Updated 4 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆140Updated this week
- Step by step implementation of a fast softmax kernel in CUDA☆54Updated 10 months ago
- Fast low-bit matmul kernels in Triton☆395Updated 3 weeks ago
- ☆63Updated this week
- Cataloging released Triton kernels.☆265Updated 2 months ago
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆171Updated this week
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.☆302Updated 2 weeks ago
- ring-attention experiments☆155Updated last year
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆128Updated last week
- My submission for the GPUMODE/AMD fp8 mm challenge☆29Updated 5 months ago
- An implementation of the transformer architecture onto an Nvidia CUDA kernel☆195Updated 2 years ago
- Simple MPI implementation for prototyping or learning☆287Updated 3 months ago
- ☆41Updated 8 months ago
- ☆247Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!☆61Updated this week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆150Updated 2 years ago
- ☆177Updated last year
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆282Updated this week