tpn / cuda-by-exampleLinks
Code for NVIDIA's CUDA By Example Book.
☆48Updated 5 years ago
Alternatives and similar repositories for cuda-by-example
Users that are interested in cuda-by-example are comparing it to the libraries listed below
Sorting:
- Standalone Flash Attention v2 kernel without libtorch dependency☆113Updated last year
- CUDA Matrix Multiplication Optimization☆252Updated last year
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡ ️ Performance.☆145Updated 8 months ago
- A set of hands-on tutorials for CUDA programming☆246Updated last year
- CUDA 6大并行计算模式 代码与笔记☆61Updated 5 years ago
- A Visual Studio Code extension for building and debugging CUDA applications.☆100Updated this week
- ☆178Updated 2 years ago
- SGEMM optimization with cuda step by step☆21Updated last year
- CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. …☆469Updated 2 years ago
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆246Updated last week
- Some CUDA design patterns and a bit of template magic for CUDA☆158Updated 2 years ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆71Updated last year
- ☆34Updated 11 months ago
- Tutorials for writing high-performance GPU operators in AI frameworks.☆135Updated 2 years ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆105Updated 7 years ago
- ☆21Updated 4 years ago
- This is a demo how to write a high performance convolution run on apple silicon☆57Updated 3 years ago
- Step-by-step optimization of CUDA SGEMM☆424Updated 3 years ago
- A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!☆54Updated 2 months ago
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆35Updated 2 years ago
- High-performance, light-weight C++ LLM and VLM Inference Software for Physical AI☆213Updated 3 weeks ago
- C++ implementations for various tokenizers (sentencepiece, tiktoken etc).☆47Updated last week
- Implement Neural Networks in Cuda from Scratch☆24Updated last year
- Awesome code, projects, books, etc. related to CUDA☆28Updated last month
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆192Updated last year
- CUDA Templates for Linear Algebra Subroutines☆101Updated last year
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Updated 7 months ago
- LLM training in simple, raw C/CUDA☆112Updated last year
- CPU Memory Compiler and Parallel programing☆26Updated last year
- study of cutlass☆22Updated last year