pigirons/sgemm_hsw

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/pigirons/sgemm_hsw)

pigirons / sgemm_hsw

This is an implementation of sgemm_kernel on L1d cache.

☆233

Alternatives and similar repositories for sgemm_hsw

Users that are interested in sgemm_hsw are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

pigirons / cpufp
View on GitHub
A CPU tool for benchmarking the peak of floating points
☆586May 4, 2026Updated 2 months ago
pigirons / conv3x3_m1
View on GitHub
This is a demo how to write a high performance convolution run on apple silicon
☆56Feb 8, 2022Updated 4 years ago
BBuf / how-to-optimize-gemm
View on GitHub
☆99May 20, 2026Updated 2 months ago
flame / how-to-optimize-gemm
View on GitHub
☆2,025Jul 29, 2023Updated 3 years ago
mit-han-lab / inter-operator-scheduler
View on GitHub
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
☆201Apr 27, 2022Updated 4 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
flame / blislab
View on GitHub
BLISlab: A Sandbox for Optimizing GEMM
☆572Jun 17, 2021Updated 5 years ago
cloudcores / CuAssembler
View on GitHub
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
☆610Apr 20, 2023Updated 3 years ago
StrongSpoon / tvm.schedule
View on GitHub
examples for tvm schedule API
☆101Jun 12, 2023Updated 3 years ago
Yinghan-Li / YHs_Sample
View on GitHub
Yinghan's Code Sample
☆365Jul 25, 2022Updated 4 years ago
tpoisonooo / how-to-optimize-gemm
View on GitHub
row-major matmul optimization
☆744May 14, 2026Updated 2 months ago
pigirons / spmv
View on GitHub
This is a tuned sparse matrix dense vector multiplication(SpMV) library
☆23Mar 21, 2016Updated 10 years ago
OpenPPL / CuAssembler
View on GitHub
An unofficial cuda assembler, for all generations of SASS, hopefully ：）
☆85Mar 20, 2023Updated 3 years ago
OpenPPL / ppl.cv
View on GitHub
ppl.cv is a high-performance image processing library of openPPL supporting various platforms.
☆515Oct 30, 2024Updated last year
OpenPPL / ppl.nn
View on GitHub
A primitive library for neural network
☆1,367Nov 24, 2024Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
OpenPPL / ppl.llm.kernel.cuda
View on GitHub
☆150Jan 9, 2025Updated last year
JohndeVostok / APE
View on GitHub
A GPU FP32 computation method with Tensor Cores.
☆27Dec 8, 2025Updated 7 months ago
yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
View on GitHub
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
☆420Jan 2, 2025Updated last year
tlc-pack / libflash_attn
View on GitHub
Standalone Flash Attention v2 kernel without libtorch dependency
☆113Sep 10, 2024Updated last year
microsoft / nnfusion
View on GitHub
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
☆1,002Sep 19, 2024Updated last year
microsoft / ConvStencil
View on GitHub
☆37Apr 10, 2024Updated 2 years ago
BBuf / tvm_mlir_learn
View on GitHub
compiler learning resources collect.
☆2,758May 20, 2026Updated 2 months ago
Ldpe2G / ArmNeonOptimization
View on GitHub
Arm neon optimization practice
☆393Dec 22, 2020Updated 5 years ago
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆82Aug 12, 2024Updated last year
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
Bruce-Lee-LY / flash_attention_inference
View on GitHub
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆45Feb 27, 2025Updated last year
frozein / QuickMathHPP
View on GitHub
a single-header math library
☆17Nov 7, 2025Updated 8 months ago
Liu-xiandong / How_to_optimize_in_GPU
View on GitHub
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…
☆1,336Jul 29, 2023Updated 3 years ago
BBuf / ArmNeonOptimization
View on GitHub
arm-neon
☆94May 20, 2026Updated 2 months ago
huawei-noah / bolt
View on GitHub
Bolt is a deep learning library with high performance and heterogeneous flexibility.
☆958Apr 11, 2025Updated last year
merrymercy / awesome-tensor-compilers
View on GitHub
A list of awesome compiler projects and papers for tensor computation and deep learning.
☆2,771Oct 19, 2024Updated last year
njuhope / cuda_sgemm
View on GitHub
☆121Apr 11, 2024Updated 2 years ago
flame / fmm-gen
View on GitHub
Generating Families of Practical Fast Matrix Multiplication Algorithms
☆12Jul 7, 2017Updated 9 years ago
nihui / valgrind-android
View on GitHub
☆62Dec 5, 2021Updated 4 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
Keenuts / vulkan-compute
View on GitHub
related to virglrender-vulkan: basic compute test application
☆19Feb 12, 2026Updated 5 months ago
Cjkkkk / CUDA_gemm
View on GitHub
A simple high performance CUDA GEMM implementation.
☆437Jan 4, 2024Updated 2 years ago
NVIDIA / cutlass
View on GitHub
CUDA Templates and Python DSLs for High-Performance Linear Algebra
☆10,154Updated this week
galois-stack / galois
View on GitHub
a tensor computing compiler based tile programming for gpu, cpu or tpu
☆45Feb 2, 2026Updated 5 months ago
Bruce-Lee-LY / cuda_hgemm
View on GitHub
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…
☆559Sep 8, 2024Updated last year
NervanaSystems / maxas
View on GitHub
Assembler for NVIDIA Maxwell architecture
☆1,074Jan 3, 2023Updated 3 years ago
lixiuhong / implicit_gemm_convolution
View on GitHub
☆14May 28, 2019Updated 7 years ago