DefTruth / hgemm-tensorcores-mma

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.

☆52

Alternatives and similar repositories for hgemm-tensorcores-mma:

Users that are interested in hgemm-tensorcores-mma are comparing it to the libraries listed below

luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆69Updated 2 months ago
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆53Updated 6 months ago
CalebDu / Awesome-Cute
☆36Updated last month
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆175Updated 3 weeks ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆104Updated 5 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆34Updated 2 months ago
AlibabaPAI / FLASHNN
☆81Updated 5 months ago
microsoft / TileFusion
TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.
☆56Updated this week
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆88Updated 11 months ago
Bruce-Lee-LY / cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
☆55Updated 5 months ago
yifuwang / symm-mem-recipes
☆44Updated last month
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆138Updated 2 weeks ago
ColfaxResearch / cfx-article-src
☆72Updated 2 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆102Updated last month
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
☆29Updated 3 months ago
alibaba / easydist
Automated Parallelization System and Infrastructure for Multiple Ecosystems
☆78Updated 3 months ago
Bruce-Lee-LY / cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
☆16Updated 4 months ago
LeiWang1999 / Stream-k.tvm
☆19Updated 4 months ago
reed-lau / cute-gemm
☆98Updated 2 months ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆35Updated 5 months ago
zeroine / cutlass-cute-sample
☆26Updated 10 months ago
LeiWang1999 / TVM.CMakeExtend
Tutorials of Extending and importing TVM with CMAKE Include dependency.
☆13Updated 4 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆97Updated 7 months ago
DefTruth / ffpa-attn-mma
📚FFPA: Yet another Faster Flash Prefill Attention with O(1)⚡️SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster than SDPA EA.
☆106Updated this week
heheda12345 / MagPy
☆38Updated 8 months ago
ColfaxResearch / cutlass-kernels
☆181Updated 7 months ago
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆87Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆61Updated 3 weeks ago
MARD1NO / CUDA-PPT
☆80Updated last year
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆64Updated 3 months ago