Bruce-Lee-LY / cuda_back2back_hgemm
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
☆11Updated last year
Related projects ⓘ
Alternatives and complementary repositories for cuda_back2back_hgemm
- ☆38Updated 4 years ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆22Updated 3 weeks ago
- An extension library of WMMA API (Tensor Core API)☆83Updated 4 months ago
- A lightweight, Pythonic, frontend for MLIR☆80Updated last year
- ☆15Updated 5 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆114Updated 4 years ago
- GPU Performance Advisor☆63Updated 2 years ago
- TPP experimentation on MLIR for linear algebra☆111Updated 3 weeks ago
- ☆40Updated 3 years ago
- ☆48Updated 8 months ago
- ☆79Updated 6 months ago
- A language and compiler for irregular tensor programs.☆134Updated 6 months ago
- ☆128Updated this week
- development repository for the open earth compiler☆77Updated 3 years ago
- A Top-Down Profiler for GPU Applications☆13Updated 8 months ago
- Cavs: An Efficient Runtime System for Dynamic Neural Networks☆13Updated 4 years ago
- An IR for efficiently simulating distributed ML computation.☆25Updated 9 months ago
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆65Updated 10 months ago
- Dissecting NVIDIA GPU Architecture☆82Updated 2 years ago
- Experiments and prototypes associated with IREE or MLIR☆49Updated 3 months ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆57Updated 4 months ago
- A Winograd Minimal Filter Implementation in CUDA☆23Updated 3 years ago
- Benchmarks to capture important workloads.☆28Updated 5 months ago
- Assembler for NVIDIA Volta and Turing GPUs☆200Updated 2 years ago
- A GPU performance prediction toolkit for CUDA programs☆16Updated 5 years ago
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆131Updated last year
- CUDA Flux is a profiler for GPU applications which reports the basic block executions frequencies of compute kernels☆31Updated 3 years ago
- Unified compiler/runtime for interfacing with PyTorch Dynamo.☆95Updated this week
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆31Updated 4 years ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆17Updated 2 years ago