tpn / cuda-by-exampleLinks

Code for NVIDIA's CUDA By Example Book.

☆47

Alternatives and similar repositories for cuda-by-example

Users that are interested in cuda-by-example are comparing it to the libraries listed below

Sorting:

puttsk / cuda-tutorial
A set of hands-on tutorials for CUDA programming
☆241Updated last year
PatWie / cuda-design-patterns
Some CUDA design patterns and a bit of template magic for CUDA
☆156Updated 2 years ago
NVIDIA / nsight-vscode-edition
A Visual Studio Code extension for building and debugging CUDA applications.
☆93Updated this week
CodedK / CUDA-by-Example-source-code-for-the-book-s-examples-
CUDA by Example, written by two senior members of the CUDA software platform team, shows programmers how to employ this new technology. …
☆459Updated 2 years ago
leimao / CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
☆239Updated last year
wangzyon / NVIDIA_SGEMM_PRACTICE
Step-by-step optimization of CUDA SGEMM
☆402Updated 3 years ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆112Updated last year
li199603 / sgemm_with_cuda
SGEMM optimization with cuda step by step
☆20Updated last year
deeperlearning / professional-cuda-c-programming
☆475Updated 10 years ago
gevtushenko / llm.c
LLM training in simple, raw C/CUDA
☆108Updated last year
Syencil / Programming_Massively_Parallel_Processors
CUDA 6大并行计算模式代码与笔记
☆61Updated 5 years ago
Libraries-Openly-Fused / cvGPUSpeedup
A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!
☆54Updated last week
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆134Updated 2 years ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆127Updated 6 months ago
BBuf / tensorrt-llm-moe
☆33Updated 9 months ago
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆229Updated 3 months ago
meta-pytorch / tokenizers
C++ implementations for various tokenizers (sentencepiece, tiktoken etc).
☆40Updated this week
cjmcv / hpc
Learning and practice of high performance computing (CUDA, Vulkan, OpenCL, OpenMP, TBB, SSE/AVX, NEON, MPI, coroutines, etc. )
☆62Updated 8 months ago
mit-han-lab / parallel-computing-tutorial
☆176Updated 2 years ago
NVIDIA / compute-sanitizer-samples
Samples demonstrating how to use the Compute Sanitizer Tools and Public API
☆90Updated 2 years ago
kberkay / Cuda-Matrix-Multiplication
Matrix Multiplication on GPU using Shared Memory considering Coalescing and Bank Conflicts
☆25Updated 3 years ago
quic / efficient-transformers
This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace (HF) hub (developed using HF transfor…
☆83Updated this week
ArchaeaSoftware / cudahandbook
Source code that accompanies The CUDA Handbook.
☆551Updated last month
MegEngine / cutlass
CUDA Templates for Linear Algebra Subroutines
☆100Updated last year
torchpipe / torchpipe
Serving Inside Pytorch
☆165Updated last week
NVlabs / cub
THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.
☆85Updated last year
pytorch / cppdocs
PyTorch C++ API Documentation
☆240Updated this week
wzsh / wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆145Updated 5 years ago
leimao / TensorRT-Custom-Plugin-Example
Quick and Self-Contained TensorRT Custom Plugin Implementation and Integration
☆73Updated 5 months ago
eegkno / CUDA_by_practice
CUDA by practice
☆130Updated 5 years ago