tpn / cuda-by-example
Code for NVIDIA's CUDA By Example Book.
☆40Updated 4 years ago
Related projects ⓘ
Alternatives and complementary repositories for cuda-by-example
- Some CUDA design patterns and a bit of template magic for CUDA☆146Updated last year
- Standalone Flash Attention v2 kernel without libtorch dependency☆98Updated 2 months ago
- CUDA Matrix Multiplication Optimization☆141Updated 4 months ago
- Implement Neural Networks in Cuda from Scratch☆22Updated 6 months ago
- Examples from Programming in Parallel with CUDA☆108Updated last year
- llama INT4 cuda inference with AWQ☆48Updated 4 months ago
- ☆36Updated 2 weeks ago
- TVMScript kernel for deformable attention☆24Updated 2 years ago
- Step-by-step optimization of CUDA SGEMM☆243Updated 2 years ago
- ☆10Updated 3 years ago
- LLM training in simple, raw C/CUDA☆87Updated 6 months ago
- ☆9Updated last month
- study of cutlass☆19Updated 2 weeks ago
- ☆32Updated last month
- The CMake version of cuda_by_example☆145Updated 4 years ago
- Matrix Algebra on GPU and Multicore Architectures (MAGMA) source releases from http://icl.cs.utk.edu/magma/index.html☆21Updated 9 years ago
- A set of hands-on tutorials for CUDA programming☆194Updated 7 months ago
- TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.☆157Updated last week
- ☆144Updated last year
- CUDA Templates for Linear Algebra Subroutines☆92Updated 7 months ago
- CUDA 6大并行计算模式 代码与笔记☆58Updated 4 years ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆29Updated 2 months ago
- ☆55Updated last year
- ☆169Updated 4 months ago
- Examples of CUDA implementations by Cutlass CuTe☆101Updated 2 weeks ago
- A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!☆42Updated 3 weeks ago
- pytorch-profiler☆50Updated last year
- Codebase associated with the PyTorch compiler tutorial☆44Updated 5 years ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆85Updated 8 months ago
- ☆63Updated 2 years ago