Faraz9877/H100_GEMM

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/Faraz9877/H100_GEMM)

Faraz9877 / H100_GEMM

High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Clusters for near-peak theoretical performance.

☆11

Alternatives and similar repositories for H100_GEMM

Users that are interested in H100_GEMM are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

Ratbuyer / h100-features
View on GitHub
☆18Mar 12, 2025Updated last year
ThomasVitale / llm-images
View on GitHub
Catalog of OCI images for popular open-source or open Large Language Models.
☆16Jan 31, 2026Updated 5 months ago
SpRegTiling / sparse-register-tiling
View on GitHub
☆10Mar 2, 2024Updated 2 years ago
OpenCilk / cilkrts
View on GitHub
A copy of the Intel Cilk Plus runtime system with modifications to work with OpenCilk and its associated tools.
☆12Jan 20, 2021Updated 5 years ago
ggsharma / microgradpp
View on GitHub
A header-only C++ autograd engine and neural network library inspired by Karpathy's micrograd. Learn backpropagation in modern C++17.
☆16Jan 14, 2026Updated 6 months ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
simveit / effective_transpose
View on GitHub
Effective transpose on Hopper GPU
☆29Sep 6, 2025Updated 10 months ago
caps-tum / mt4g
View on GitHub
Memory Topology for GPUs
☆19Jul 8, 2026Updated last week
HuyNguyen-hust / hopper-gemm-101
View on GitHub
☆13Dec 22, 2024Updated last year
FlorianRhiem / VFRendering
View on GitHub
A vector field rendering library
☆17Jul 31, 2019Updated 6 years ago
kasshu / SystemCallMock
View on GitHub
A system call mock demonstration for gmock(c++)
☆15Jun 14, 2018Updated 8 years ago
JohndeVostok / APE
View on GitHub
A GPU FP32 computation method with Tensor Cores.
☆27Dec 8, 2025Updated 7 months ago
Qwesh157 / conv_op_optimization
View on GitHub
This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.
☆44Sep 29, 2025Updated 9 months ago
doingself / ARKitApp
View on GitHub
arkit demo
☆11Aug 20, 2018Updated 7 years ago
muriloboratto / NVSHEMEM
View on GitHub
Sample Codes using NVSHMEM on Multi-GPU
☆30Jan 22, 2023Updated 3 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
SouhailHammou / Custom-VM
View on GitHub
Virtual machine with a custom instruction set in C
☆16Jul 17, 2018Updated 8 years ago
xxyux / SpInfer
View on GitHub
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆68Mar 25, 2025Updated last year
basnijholt / variational-quantum-monte-carlo
View on GitHub
2014: Variational Monte Carlo for the harmonic oscillator, helium, hydrogen and H2 - IPython notebook and FORTRAN90
☆13Jun 23, 2016Updated 10 years ago
zhuzilin / flash-attention-with-sink
View on GitHub
☆37Aug 7, 2025Updated 11 months ago
BDAI-Research / DFLOP
View on GitHub
☆17Apr 16, 2026Updated 3 months ago
hyhuang00 / moe_inference
View on GitHub
Code Repository for the NeurIPS 2024 Paper "Toward Efficient Inference for Mixture of Experts".
☆19Oct 30, 2024Updated last year
ShaYeBuHui01 / flash_attention_inference
View on GitHub
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆15Aug 31, 2023Updated 2 years ago
gafert / Apate
View on GitHub
A graphical and educational processor simulator based on the RISC-V instruction set architecture
☆11Apr 28, 2024Updated 2 years ago
CalebDu / Awesome-Cute
View on GitHub
☆121May 16, 2025Updated last year
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
enp1s0 / cuMpSGEMM
View on GitHub
Fast SGEMM emulation on Tensor Cores
☆17Feb 16, 2025Updated last year
pranjalssh / fast.cu
View on GitHub
Fastest kernels written from scratch
☆583Sep 18, 2025Updated 10 months ago
ajtejankar / mixtral-vis-moe
View on GitHub
Visualize expert firing frequencies across sentences in the Mixtral MoE model
☆18Dec 22, 2023Updated 2 years ago
RGivisiez / Heisenberg-SSE
View on GitHub
Stochastic Series Expansion (SSE) for a isotropic S=1/2 antiferromagnetic quantum Heisenberg model in 1D, 2D or 3D lattice . Every lattic…
☆15Jan 23, 2021Updated 5 years ago
NVlabs / mixedproxy
View on GitHub
☆15Nov 14, 2023Updated 2 years ago
raywan-110 / AdaQP
View on GitHub
Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training
☆24Mar 1, 2024Updated 2 years ago
YJHMITWEB / ExFlow
View on GitHub
Explore Inter-layer Expert Affinity in MoE Model Inference
☆16May 6, 2024Updated 2 years ago
lucasew-graveyard / pocket2kindle
View on GitHub
A thing to convert pocket articles to kindle books
☆15Apr 12, 2025Updated last year
naoyam / MemoryTracer-pintool
View on GitHub
☆17Aug 4, 2014Updated 11 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
Paramathic / slim
View on GitHub
SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs (ICML 2025)
☆36Nov 28, 2025Updated 7 months ago
wangqiang9 / Awesome-Diffusion-MoE
View on GitHub
Awesome MoE Diffusion Models
☆21Mar 25, 2026Updated 3 months ago
abecirovic3 / MIC-1-Simulator
View on GitHub
Simulator for the MIC-1 CPU described in Andrew S. Tanenbaum’s textbook Structured Computer Organization
☆14Jun 21, 2022Updated 4 years ago
chyyran / notes
View on GitHub
Notes for my time at the University of Toronto
☆35Apr 18, 2023Updated 3 years ago
emstoudenmire / parallelDMRG
View on GitHub
Real-space parallel density matrix renormalization group (DMRG) based on ITensor
☆20Dec 22, 2020Updated 5 years ago
meatcar / augmented-reality
View on GitHub
Draw in the world around you with OpenGL, and OpenCV
☆13May 7, 2014Updated 12 years ago
TanDongXu / CUDA-MCDNN
View on GitHub
☆12Jul 13, 2017Updated 9 years ago