yzhaiustc / Optimizing-SGEMV-on-NVIDIA-GPUsLinks

An implementation of SGEMV with performance comparable to cuBLAS.

☆10

Alternatives and similar repositories for Optimizing-SGEMV-on-NVIDIA-GPUs

Users that are interested in Optimizing-SGEMV-on-NVIDIA-GPUs are comparing it to the libraries listed below

Sorting:

wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆99Updated last year
mmperf / mmperf
MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.
☆133Updated last year
sunlex0717 / DissectingTensorCores
☆104Updated last year
wzsh / wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆138Updated 4 years ago
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆95Updated 6 years ago
Bruce-Lee-LY / cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
☆63Updated 10 months ago
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆73Updated 11 months ago
reed-lau / cute-gemm
☆125Updated 7 months ago
lixiuhong / batched_gemm
☆39Updated 5 years ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆93Updated this week
nicolaswilde / cuda-tensorcore-hgemm
☆148Updated 6 months ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆184Updated 5 months ago
OpenPPL / ppl.llm.kernel.cuda
☆149Updated 6 months ago
ColfaxResearch / cfx-article-src
☆124Updated 2 months ago
leimao / CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
☆202Updated last year
CalebDu / Awesome-Cute
☆87Updated 2 months ago
intel / xetla
☆62Updated 7 months ago
daadaada / turingas
Assembler for NVIDIA Volta and Turing GPUs
☆224Updated 3 years ago
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆91Updated last year
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆111Updated 10 months ago
njuhope / cuda_sgemm
☆113Updated last year
gty111 / GEMM_MMA
Optimize GEMM with tensorcore step by step
☆29Updated last year
zeroine / cutlass-cute-sample
☆37Updated last year
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆39Updated 4 months ago
ColfaxResearch / cutlass-kernels
☆223Updated last year
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆206Updated 2 weeks ago
RRZE-HPC / gpu-benches
collection of benchmarks to measure basic GPU capabilities
☆393Updated 5 months ago
pigirons / conv3x3_m1
This is a demo how to write a high performance convolution run on apple silicon
☆54Updated 3 years ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆87Updated 2 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆112Updated last year