jundaf2 / eigenMHA
Forward and backward Attention DNN operators implementationed by LibTorch, cuDNN, and Eigen.
☆29Updated last year
Alternatives and similar repositories for eigenMHA:
Users that are interested in eigenMHA are comparing it to the libraries listed below
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆345Updated 5 months ago
- ☆58Updated last month
- ☆98Updated 2 months ago
- Yinghan's Code Sample☆305Updated 2 years ago
- Assembler for NVIDIA Volta and Turing GPUs☆212Updated 3 years ago
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆324Updated last month
- A simple high performance CUDA GEMM implementation.☆346Updated last year
- ☆129Updated last month
- An extension library of WMMA API (Tensor Core API)☆88Updated 7 months ago
- Step-by-step optimization of CUDA SGEMM☆284Updated 2 years ago
- A Easy-to-understand TensorOp Matmul Tutorial☆316Updated 5 months ago
- collection of benchmarks to measure basic GPU capabilities☆296Updated last week
- Examples of CUDA implementations by Cutlass CuTe☆138Updated 2 weeks ago
- A library of GPU kernels for sparse matrix operations.☆255Updated 4 years ago
- A Winograd Minimal Filter Implementation in CUDA☆24Updated 3 years ago
- This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.☆25Updated last month
- Dissecting NVIDIA GPU Architecture☆88Updated 2 years ago
- how to design cpu gemm on x86 with avx256, that can beat openblas.☆67Updated 5 years ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆175Updated 3 weeks ago
- ☆26Updated 10 months ago
- ☆142Updated last month
- play gemm with tvm☆87Updated last year
- ☆109Updated 10 months ago
- ☆181Updated 7 months ago
- ☆87Updated 10 months ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆55Updated 5 months ago
- ☆80Updated last year
- MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.☆127Updated last year
- A intelligent matrix format designer for SpMV☆8Updated last year
- Shared Middle-Layer for Triton Compilation☆226Updated this week