coderonion / awesome-cuda-and-hpcLinks
๐๐๐ This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
โ410Updated 3 months ago
Alternatives and similar repositories for awesome-cuda-and-hpc
Users that are interested in awesome-cuda-and-hpc are comparing it to the libraries listed below
Sorting:
- CUDA Matrix Multiplication Optimizationโ239Updated last year
- ๅ ่ฟ็ผ่ฏๅฎ้ชๅฎค็ไธชไบบไธป้กตโ169Updated last month
- CSV spreadsheets and other material for AI accelerator survey papersโ182Updated this week
- โ145Updated last year
- A CUDA tutorial to make people learn CUDA program from 0โ260Updated last year
- A Easy-to-understand TensorOp Matmul Tutorialโ394Updated last month
- ๐ A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and softwareโ58Updated 9 months ago
- ๐200+ Tensor/CUDA Cores Kernels, โก๏ธflash-attn-mma, โก๏ธhgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 ๐๐).โ51Updated 7 months ago
- A simple high performance CUDA GEMM implementation.โ418Updated last year
- learning how CUDA worksโ344Updated 8 months ago
- โ156Updated 11 months ago
- Examples of CUDA implementations by Cutlass CuTeโ251Updated 4 months ago
- hands on model tuning with TVM and profile it on a Mac M1, x86 CPU, and GTX-1080 GPU.โ50Updated 2 years ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instructโฆโ501Updated last year
- โ274Updated last month
- ONNXim is a fast cycle-level simulator that can model multi-core NPUs for DNN inferenceโ168Updated 9 months ago
- โ176Updated 2 years ago
- Hands-On Practical MLIR Tutorialโ44Updated 3 months ago
- This is the top-level repository for the Accel-Sim framework.โ518Updated this week
- Solution of Programming Massively Parallel Processorsโ50Updated last year
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.โ394Updated 10 months ago
- โ144Updated last year
- โ70Updated 10 months ago
- โ139Updated last week
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.โ113Updated 4 months ago
- PyTorch emulation library for Microscaling (MX)-compatible data formatsโ319Updated 5 months ago
- Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]โ318Updated 3 years ago
- โ209Updated last month
- collection of benchmarks to measure basic GPU capabilitiesโ459Updated last month
- Dissecting NVIDIA GPU Architectureโ110Updated 3 years ago