Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch
☆942Jul 19, 2023Updated 2 years ago
Alternatives and similar repositories for cuda_programming
Users that are interested in cuda_programming are comparing it to the libraries listed below
Sorting:
- Implement Neural Networks in Cuda from Scratch☆24May 17, 2024Updated last year
- Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)☆942Aug 19, 2024Updated last year
- Learn CUDA Programming, published by Packt☆1,231Dec 30, 2023Updated 2 years ago
- Samples for CUDA Developers which demonstrates features in CUDA Toolkit☆8,870Jan 6, 2026Updated last month
- Fast CUDA matrix multiplication from scratch☆1,060Sep 2, 2025Updated 5 months ago
- ☆3,316Feb 7, 2026Updated 3 weeks ago
- Step-by-step optimization of CUDA SGEMM☆432Mar 30, 2022Updated 3 years ago
- CUDA Library Samples☆2,324Feb 21, 2026Updated last week
- GPU programming related news and material links☆1,997Sep 17, 2025Updated 5 months ago
- CUDA Templates and Python DSLs for High-Performance Linear Algebra☆9,315Updated this week
- Material for gpu-mode lectures☆5,773Feb 1, 2026Updated last month
- Sample codes for my CUDA programming book☆2,010Dec 14, 2025Updated 2 months ago
- SOTA results for reid baseline model (Gluon implementation)☆13Aug 6, 2018Updated 7 years ago
- Development repository for the Triton language and compiler☆18,501Updated this week
- how to optimize some algorithm in cuda.☆2,825Feb 15, 2026Updated 2 weeks ago
- Simple pytest examples☆14Jan 4, 2025Updated last year
- Standalone Flash Attention v2 kernel without libtorch dependency☆114Sep 10, 2024Updated last year
- ☆177Feb 3, 2024Updated 2 years ago
- This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…☆1,244Jul 29, 2023Updated 2 years ago
- ☆485Jul 5, 2015Updated 10 years ago
- A CUDA tutorial to make people learn CUDA program from 0☆267Jul 9, 2024Updated last year
- A simple high performance CUDA GEMM implementation.☆426Jan 4, 2024Updated 2 years ago
- CUDA Core Compute Libraries☆2,182Updated this week
- This repository contains lectures designed for an introduction to RISC-v and it's capabilities.☆10Sep 19, 2025Updated 5 months ago
- A set of hands-on tutorials for CUDA programming☆247Apr 8, 2024Updated last year
- Tile primitives for speedy kernels☆3,183Updated this week
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆73May 26, 2024Updated last year
- Flash Attention in ~100 lines of CUDA (forward pass only)☆1,079Dec 30, 2024Updated last year
- ☆454Dec 18, 2025Updated 2 months ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆526Sep 8, 2024Updated last year
- C++ Implementation of PyTorch Tutorials for Everyone☆2,127Aug 25, 2025Updated 6 months ago
- Advanced Formal Language Theory (263-5352-00L; Frühjahr 2023)☆10Feb 21, 2023Updated 3 years ago
- PyTorch implementation for PaLM: A Hybrid Parser and Language Model.☆10Jan 7, 2020Updated 6 years ago
- My study notes and hands-on projects for CUDA-based GPU programming☆10Dec 11, 2025Updated 2 months ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆144Aug 18, 2020Updated 5 years ago
- Lightning fast C++/CUDA neural network framework☆4,418Dec 14, 2025Updated 2 months ago
- Fastest kernels written from scratch☆548Sep 18, 2025Updated 5 months ago
- This is a list of useful libraries and resources for CUDA development.☆603Oct 8, 2017Updated 8 years ago
- collection of benchmarks to measure basic GPU capabilities☆497Oct 24, 2025Updated 4 months ago