digital-nomad-cheng / ECE408_Applied_Parallel_ProgrammingLinks
CUDA solutions for the lab assignments in the UIUC-ECE408 Applied Parallel Programming course.
☆16Updated 2 years ago
Alternatives and similar repositories for ECE408_Applied_Parallel_Programming
Users that are interested in ECE408_Applied_Parallel_Programming are comparing it to the libraries listed below
Sorting:
- CUDA Matrix Multiplication Optimization☆235Updated last year
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆145Updated 5 years ago
- CUTLASS and CuTe Examples☆98Updated 3 weeks ago
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.☆89Updated 2 years ago
- Step-by-step optimization of CUDA SGEMM☆390Updated 3 years ago
- ☆38Updated last year
- ☆109Updated last year
- A Easy-to-understand TensorOp Matmul Tutorial☆389Updated 3 weeks ago
- ☆243Updated last year
- Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…☆75Updated 4 years ago
- ☆50Updated 6 years ago
- ☆33Updated last year
- A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores☆54Updated last year
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆490Updated last year
- Source code of the PPoPP '22 paper: "TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs" by Y…☆42Updated last year
- Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]☆74Updated 3 years ago
- ☆156Updated 10 months ago
- ☆154Updated 6 months ago
- A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆61Updated 2 weeks ago
- Dissecting NVIDIA GPU Architecture☆109Updated 3 years ago
- Examples of CUDA implementations by Cutlass CuTe☆246Updated 4 months ago
- A simple high performance CUDA GEMM implementation.☆415Updated last year
- 📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software☆56Updated 8 months ago
- ☆47Updated last year
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆387Updated 10 months ago
- ☆41Updated last year
- ☆131Updated 2 weeks ago
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆138Updated 2 years ago
- FlashSparse significantly reduces the computation redundancy for unstructured sparsity (for SpMM and SDDMM) on Tensor Cores through a Swa…☆32Updated last month
- ☆123Updated 2 weeks ago