coderonion / awesome-cuda-and-hpcLinks
๐๐๐ This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
โ439Updated 5 months ago
Alternatives and similar repositories for awesome-cuda-and-hpc
Users that are interested in awesome-cuda-and-hpc are comparing it to the libraries listed below
Sorting:
- CUDA Matrix Multiplication Optimizationโ252Updated last year
- A CUDA tutorial to make people learn CUDA program from 0โ266Updated last year
- ๐ A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and softwareโ60Updated 11 months ago
- This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.โ43Updated 4 months ago
- ๅ ่ฟ็ผ่ฏๅฎ้ชๅฎค็ไธชไบบไธป้กตโ192Updated 3 months ago
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.โ402Updated last year
- โ284Updated last week
- A simple high performance CUDA GEMM implementation.โ426Updated 2 years ago
- Solution of Programming Massively Parallel Processorsโ49Updated 2 years ago
- Personal Notes for Learning HPC & Parallel Computation [NO LONGER ADDING NEW CONTENT]โ76Updated 3 years ago
- โ145Updated last year
- A Easy-to-understand TensorOp Matmul Tutorialโ405Updated 3 weeks ago
- Examples of CUDA implementations by Cutlass CuTeโ270Updated 7 months ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instructโฆโ517Updated last year
- CSV spreadsheets and other material for AI accelerator survey papersโ189Updated 2 months ago
- Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]โ322Updated 3 years ago
- ๐200+ Tensor/CUDA Cores Kernels, โก๏ธflash-attn-mma, โก๏ธhgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 ๐๐).โ62Updated 9 months ago
- CUDA PTX-ISA Document ไธญๆ็ฟป่ฏ็โ49Updated 4 months ago
- learning how CUDA worksโ369Updated 10 months ago
- โ157Updated last year
- โ144Updated last year
- โ70Updated last year
- โ26Updated 5 months ago
- Implement custom operators in PyTorch with cuda/c++โ76Updated 3 years ago
- A tutorial for CUDA&PyTorchโ208Updated last week
- โ179Updated 2 years ago
- FlagTree is a unified compiler supporting multiple AI chip backends for custom Deep Learning operations, which is forked from triton-langโฆโ197Updated this week
- collection of benchmarks to measure basic GPU capabilitiesโ484Updated 3 months ago
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.โ163Updated 3 years ago
- PyTorch emulation library for Microscaling (MX)-compatible data formatsโ337Updated 7 months ago