coderonion/awesome-cuda-and-hpc

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/coderonion/awesome-cuda-and-hpc)

coderonion / awesome-cuda-and-hpc

🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.

☆501

Alternatives and similar repositories for awesome-cuda-and-hpc

Users that are interested in awesome-cuda-and-hpc are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

DD-DuDa / Cute-Learning
View on GitHub
Examples of CUDA implementations by Cutlass CuTe
☆281Jul 1, 2025Updated last year
xlite-dev / LeetCUDA
View on GitHub
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
☆11,655Updated this week
66RING / tiny-flash-attention
View on GitHub
flash attention tutorial written in python, triton, cuda, cutlass
☆528Jan 20, 2026Updated 6 months ago
BBuf / how-to-optim-algorithm-in-cuda
View on GitHub
how to optimize some algorithm in cuda.
☆3,152Updated this week
IST-DASLab / gemm-fp8
View on GitHub
High Performance FP8 GEMM Kernels for SM89 and later GPUs.
☆21Jan 24, 2025Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
caibucai22 / awesome-cuda
View on GitHub
Awesome code, projects, books, etc. related to CUDA
☆38Jun 2, 2026Updated last month
xlite-dev / ffpa-attn
View on GitHub
🤖FFPA: Extends FA-2/3 via Split-D for large headdims, 1.5x~6×↑🎉 vs SDPA, up to 513~535 TFLOPS🎉 on NVIDIA H200.
☆318Updated this week
leimao / CUTLASS-Examples
View on GitHub
CUTLASS and CuTe Examples
☆137Nov 30, 2025Updated 7 months ago
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆6,053Updated this week
Dao-AILab / quack
View on GitHub
A Quirky Assortment of CuTe Kernels
☆1,076Updated this week
infinigence / FlashOverlap
View on GitHub
A lightweight design for computation-communication overlap.
☆243Jan 20, 2026Updated 6 months ago
gpu-mode / triton-index
View on GitHub
Cataloging released Triton kernels.
☆310Sep 9, 2025Updated 10 months ago
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆195Feb 11, 2026Updated 5 months ago
microsoft / TileFusion
View on GitHub
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆115Jun 28, 2025Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
NVIDIA / cutlass
View on GitHub
CUDA Templates and Python DSLs for High-Performance Linear Algebra
☆10,151Updated this week
NVIDIA / cccl
View on GitHub
CUDA Core Compute Libraries
☆2,443Updated this week
ArthurinRUC / cutlass-notes
View on GitHub
From Minimal GEMM to Everything
☆230Jul 9, 2026Updated 2 weeks ago
PaddleJitLab / CUDATutorial
View on GitHub
A self-learning tutorail for CUDA High Performance Programing.
☆1,053Jan 14, 2026Updated 6 months ago
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆82Aug 12, 2024Updated last year
tile-ai / tilelang
View on GitHub
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
☆7,007Updated this week
MekkCyber / CutlassAcademy
View on GitHub
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆268May 6, 2025Updated last year
NVIDIA / nvshmem
View on GitHub
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…
☆566Jul 20, 2026Updated last week
JJXiangJiaoJun / cutlass_gemv
View on GitHub
GEMV implementation with CUTLASS
☆21Aug 21, 2025Updated 11 months ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
Erkaman / Awesome-CUDA
View on GitHub
This is a list of useful libraries and resources for CUDA development.
☆622Oct 8, 2017Updated 8 years ago
Liu-xiandong / How_to_optimize_in_GPU
View on GitHub
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…
☆1,335Jul 29, 2023Updated 3 years ago
siboehm / SGEMM_CUDA
View on GitHub
Fast CUDA matrix multiplication from scratch
☆1,265Sep 2, 2025Updated 10 months ago
microsoft / BitBLAS
View on GitHub
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆769Aug 6, 2025Updated 11 months ago
Tongkaio / CUDA_Kernel_Samples
View on GitHub
CUDA 算子手撕与面试指南
☆1,055Aug 23, 2025Updated 11 months ago
flagos-ai / FlagGems
View on GitHub
FlagGems is an operator library for large language models implemented in the Triton Language.
☆1,057Updated this week
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,348Aug 28, 2025Updated 11 months ago
xlite-dev / Awesome-LLM-Inference
View on GitHub
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
☆5,424Updated this week
sablin39 / tilelang-cuda-skills
View on GitHub
Skills for writing tilelang and debugging with CUDA toolkits.
☆133May 20, 2026Updated 2 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
ByteDance-Seed / cudaLLM
View on GitHub
☆149Aug 18, 2025Updated 11 months ago
tpoisonooo / how-to-optimize-gemm
View on GitHub
row-major matmul optimization
☆744May 14, 2026Updated 2 months ago
pranjalssh / fast.cu
View on GitHub
Fastest kernels written from scratch
☆587Sep 18, 2025Updated 10 months ago
deepseek-ai / DeepGEMM
View on GitHub
DeepGEMM: clean and efficient BLAS kernel library on GPU
☆7,577Jul 20, 2026Updated last week
ColfaxResearch / cutlass-kernels
View on GitHub
☆270Jul 11, 2024Updated 2 years ago
BBuf / tensorrt-llm-moe
View on GitHub
☆34Feb 3, 2025Updated last year
yuninxia / awesome-gemm
View on GitHub
📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software
☆67Feb 23, 2025Updated last year