CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. The authors introduce each area of CUDA development through working examples.
☆476Jun 30, 2023Updated 2 years ago
Alternatives and similar repositories for CUDA-by-Example-source-code-for-the-book-s-examples-
Users that are interested in CUDA-by-Example-source-code-for-the-book-s-examples- are comparing it to the libraries listed below
Sorting:
- ☆485Jul 5, 2015Updated 10 years ago
- Learn CUDA Programming, published by Packt☆1,231Dec 30, 2023Updated 2 years ago
- Samples for CUDA Developers which demonstrates features in CUDA Toolkit☆8,870Jan 6, 2026Updated last month
- 基于 CUDA Driver API 的 cuda 运行时环境☆15Jul 30, 2025Updated 7 months ago
- CUDA Library Samples☆2,324Feb 21, 2026Updated last week
- The CMake version of cuda_by_example☆148Jul 24, 2020Updated 5 years ago
- Sample codes for my CUDA programming book☆2,010Dec 14, 2025Updated 2 months ago
- how to optimize some algorithm in cuda.☆2,825Feb 15, 2026Updated 2 weeks ago
- Source code that accompanies The CUDA Handbook.☆568Oct 7, 2025Updated 4 months ago
- An MLIR-based compiler from C/C++ to AMD-Xilinx Versal AIE☆18Aug 5, 2022Updated 3 years ago
- ☆2,698Jan 16, 2024Updated 2 years ago
- CUDA Templates and Python DSLs for High-Performance Linear Algebra☆9,315Updated this week
- Transformer related optimization, including BERT, GPT☆6,394Mar 27, 2024Updated last year
- This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…☆1,244Jul 29, 2023Updated 2 years ago
- Source code examples from the Parallel Forall Blog☆1,322Sep 23, 2025Updated 5 months ago
- TensorRT-in-Action 是一个 GitHub 代码库,提供了使用 TensorRT 的代码示例,并有对应 Jupyter Notebook。☆15Jun 1, 2023Updated 2 years ago
- A simple high performance CUDA GEMM implementation.☆426Jan 4, 2024Updated 2 years ago
- 📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉☆9,755Updated this week
- CUDA SGEMM optimization note☆15Oct 31, 2023Updated 2 years ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆72Sep 8, 2024Updated last year
- Step-by-step optimization of CUDA SGEMM☆432Mar 30, 2022Updated 3 years ago
- Triton Documentation in Chinese Simplified / Triton 中文文档☆105Dec 17, 2025Updated 2 months ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Apr 2, 2025Updated 11 months ago
- ☆16Apr 28, 2023Updated 2 years ago
- Material for gpu-mode lectures☆5,773Feb 1, 2026Updated last month
- flash attention tutorial written in python, triton, cuda, cutlass☆488Jan 20, 2026Updated last month
- Several simple examples for popular neural network toolkits calling custom CUDA operators.☆1,526Apr 29, 2021Updated 4 years ago
- This repository contains the results and code for the MLPerf™ Training v2.1 benchmark.☆15Aug 9, 2023Updated 2 years ago
- Deferred Continuous Batching in Resource-Efficient Large Language Model Serving (EuroMLSys 2024)☆19May 28, 2024Updated last year
- Development repository for the Triton language and compiler☆18,501Updated this week
- ☆53Updated this week
- 分层解耦的深度学习推理引擎☆79Feb 17, 2025Updated last year
- Introduction to Parallel Programming class code☆1,345Jun 27, 2022Updated 3 years ago
- extensible collectives library in triton☆95Mar 31, 2025Updated 11 months ago
- CUDA Core Compute Libraries☆2,182Updated this week
- Simple samples for TensorRT programming☆1,658Jan 22, 2026Updated last month
- Benchmarking PyTorch 2.0 different models☆20Mar 19, 2023Updated 2 years ago
- This is a Chinese translation of the CUDA programming guide☆1,877Nov 13, 2024Updated last year
- row-major matmul optimization☆703Updated this week