HamzaElshafie/h100_gemm

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/HamzaElshafie/h100_gemm)

HamzaElshafie / h100_gemm

A series of high-performance GEMM (General Matrix Multiply) implementations Iteratively optimised for H100 GPUs in Pure CUDA.

☆79

Alternatives and similar repositories for h100_gemm

Users that are interested in h100_gemm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

mayankagarwals / MLSys-FlashLinfer-Contest
View on GitHub
☆49Jul 14, 2026Updated 2 weeks ago
Libraries-Openly-Fused / FusedKernelLibrary
View on GitHub
We aim to redefine Data Parallel libraries portabiliy, performance, programability and maintainability, by using C++ standard features, i…
☆53Updated this week
a-r-r-o-w / productionizing-diffusion
View on GitHub
Optimizing diffusion for production-ready speeds
☆40Jan 10, 2026Updated 6 months ago
Klyne-org / Enigma-DSL
View on GitHub
A Python DSL for Apple Metal GPU compute
☆25Updated this week
pgbreen / NVM
View on GitHub
Deep neural network from Newton vs the Machine
☆16Oct 23, 2019Updated 6 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
sukoncon / TMA-Adaptive-FP8-Grouped-GEMM
View on GitHub
☆27Aug 28, 2025Updated 11 months ago
danielvegamyhre / gemm
View on GitHub
☆19Mar 29, 2026Updated 4 months ago
0xD4rky / nanotok
View on GitHub
☆27Jun 7, 2026Updated last month
patrick-toulme / pyptx
View on GitHub
A Python DSL to write Nvidia PTX for Hopper and Blackwell in JAX and PyTorch
☆369Jul 9, 2026Updated 3 weeks ago
pranjalssh / fast.cu
View on GitHub
Fastest kernels written from scratch
☆587Sep 18, 2025Updated 10 months ago
aryagxr / cuda
View on GitHub
coding CUDA everyday!
☆77Feb 5, 2026Updated 5 months ago
cudacourseh100 / H100-Course
View on GitHub
Repository for the CUDA H100 Course
☆67Apr 12, 2026Updated 3 months ago
HamzaElshafie / gpt-oss-20B
View on GitHub
A PyTorch implementation of the GPT-OSS-20B architecture. All components are coded from scratch: RoPE with YaRN, RMSNorm, SwiGLU with cla…
☆238Dec 2, 2025Updated 7 months ago
simveit / effective_transpose
View on GitHub
Effective transpose on Hopper GPU
☆29Sep 6, 2025Updated 10 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
gau-nernst / gn-kernels
View on GitHub
☆35Jun 28, 2026Updated last month
gautam1858 / tiny-gpu-compiler
View on GitHub
An MLIR-based compiler that takes GPU kernels and compiles them to real hardware instructions. Interactive web visualizer included.
☆140Mar 21, 2026Updated 4 months ago
Wodlfvllf / QuintNet
View on GitHub
QuintNet is a research-oriented PyTorch framework designed to explore and implement multi-dimensional parallelism strategies for distribu…
☆19Feb 3, 2026Updated 5 months ago
gpuasm / autosass
View on GitHub
☆17Mar 29, 2026Updated 4 months ago
Dogacel / auto-gpu-kernel
View on GitHub
Winner 🏆 (Agent-only) MLSys 2026 - FlashInfer AI Kernel Generation Contest for the DeepSeek Sparse Attention (DSA) track with an average…
☆148Jun 10, 2026Updated last month
RightNow-AI / tiny-tpu
View on GitHub
Minimal TPU implementation with 8x8 systolic array and PyTorch integration
☆66Jan 26, 2026Updated 6 months ago
StigLidu / AdaExplore
View on GitHub
The official implementation for paper "AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generat…
☆22Jul 12, 2026Updated 2 weeks ago
Chtholly-Boss / swizzle
View on GitHub
A practical way of learning Swizzle
☆45Feb 3, 2025Updated last year
daniel-geon-park / triton_bwd
View on GitHub
Automatic differentiation for Triton Kernels
☆29Aug 12, 2025Updated 11 months ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
Better-Call-Paul / blackwell_gemm
View on GitHub
☆19Apr 26, 2026Updated 3 months ago
ademeure / cuda-side-boost
View on GitHub
☆60Feb 24, 2026Updated 5 months ago
IST-DASLab / llmq
View on GitHub
Quantized LLM training in pure CUDA/C++.
☆251Updated this week
computer-graphics-tools / ane
View on GitHub
☆42Mar 5, 2026Updated 4 months ago
pierrel55 / llama_st
View on GitHub
Load and run Llama from safetensors files in C
☆15Oct 24, 2024Updated last year
Infatoshi / physics-llm-inference
View on GitHub
Companion code for The Physics of LLM Inference book
☆26Apr 21, 2026Updated 3 months ago
ArthurinRUC / cutlass-notes
View on GitHub
From Minimal GEMM to Everything
☆230Jul 9, 2026Updated 2 weeks ago
gpusgobrr / explore-gemm
View on GitHub
Exploring how optimizations for GEMMs work
☆36Feb 28, 2026Updated 5 months ago
vipuldivyanshu92 / ANEgpt
View on GitHub
☆86Mar 3, 2026Updated 4 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
AutonomousOfficial / 8xGPUs
View on GitHub
A hands-on guide for AI builders: make your own RTX 4090D/5090 GPU server that’s fast and efficient.
☆17Aug 19, 2025Updated 11 months ago
carsonpo / quadmul
View on GitHub
a fast and customizable CUDA int4 tensor core gemm
☆15Aug 2, 2024Updated last year
ColfaxResearch / cfx-article-src
View on GitHub
☆193May 7, 2025Updated last year
ColfaxResearch / layout-categories
View on GitHub
This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".
☆140Sep 24, 2025Updated 10 months ago
gau-nernst / learn-cuda
View on GitHub
Learn CUDA with PyTorch
☆360Jun 1, 2026Updated last month
hova88 / CUDA-MatMul-Practice
View on GitHub
☆19Jan 4, 2024Updated 2 years ago
jmelo11 / quantsupport
View on GitHub
A financial pricing library in Rust
☆19Jun 6, 2026Updated last month