ademeure/DeeperGEMM

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ademeure/DeeperGEMM)

ademeure / DeeperGEMM

DeeperGEMM: crazy optimized version

☆86

Alternatives and similar repositories for DeeperGEMM

Users that are interested in DeeperGEMM are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

zhuzilin / flash-attention-with-sink
View on GitHub
☆37Aug 7, 2025Updated 11 months ago
KuangjuX / AttnLink
View on GitHub
An experimental communicating attention kernel based on DeepEP.
☆34Jul 29, 2025Updated 11 months ago
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated last year
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆195Feb 11, 2026Updated 5 months ago
TiledTensor / TiledBench
View on GitHub
Benchmark tests supporting the TiledCUDA library.
☆19Nov 19, 2024Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
aikitoria / nanotrace
View on GitHub
Low overhead tracing library and trace visualizer for pipelined CUDA kernels
☆137Updated this week
ademeure / cuda-side-boost
View on GitHub
☆60Feb 24, 2026Updated 4 months ago
tile-ai / AttentionEngine
View on GitHub
☆52May 19, 2025Updated last year
leepoly / sm-profiler
View on GitHub
☆82Feb 5, 2026Updated 5 months ago
cherichy / tilecute
View on GitHub
☆32Jul 2, 2025Updated last year
flashinfer-ai / cubloaty
View on GitHub
a size profiler for cuda binary
☆71Jan 15, 2026Updated 6 months ago
Dao-AILab / sonic-moe
View on GitHub
Accelerating MoE with IO and Tile-aware Optimizations
☆732Jul 4, 2026Updated 2 weeks ago
dsl-learn / cutile-learn
View on GitHub
NVIDIA cuTile learn
☆169Dec 9, 2025Updated 7 months ago
nex-agi / NexVenusCL
View on GitHub
Nex Venus Communication Library
☆75Nov 17, 2025Updated 8 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
antgroup / DeepXTrace
View on GitHub
DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.
☆100Jan 16, 2026Updated 6 months ago
li-plus / flash-preference
View on GitHub
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆52Jul 4, 2025Updated last year
microsoft / TileFusion
View on GitHub
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆115Jun 28, 2025Updated last year
inclusionAI / humming
View on GitHub
☆160Updated this week
ByteDance-Seed / Triton-distributed
View on GitHub
Distributed Compiler based on Triton for Parallel Systems
☆1,495Updated this week
Triang-jyed-driung / i8muon
View on GitHub
Muon in Int8 Precision Made Possible
☆20Jun 18, 2026Updated last month
infinigence / FlashOverlap
View on GitHub
A lightweight design for computation-communication overlap.
☆242Jan 20, 2026Updated 6 months ago
HydraQYH / hp_rms_norm
View on GitHub
High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)
☆30Jan 22, 2026Updated 5 months ago
GeeeekExplorer / kkbot
View on GitHub
A Feishu/Lark AI agent bot
☆15Feb 27, 2026Updated 4 months ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
perplexityai / pplx-kernels
View on GitHub
Perplexity GPU Kernels
☆591Nov 7, 2025Updated 8 months ago
flashinfer-ai / debug-print
View on GitHub
Debug print operator for cudagraph debugging
☆18Aug 2, 2024Updated last year
luliyucoordinate / cute-flash-attention
View on GitHub
Implement Flash Attention using Cute.
☆108Dec 17, 2024Updated last year
lemyx / tilelang-dsa
View on GitHub
DeepSeek-V3.2-Exp DSA Warmup Lightning Indexer training operator based on tilelang
☆47Nov 19, 2025Updated 8 months ago
xlite-dev / ffpa-attn
View on GitHub
🤖FFPA: Extends FA-2/3 via Split-D for large headdims, 1.5x~6×↑🎉 vs SDPA, up to 513~535 TFLOPS🎉 on NVIDIA H200.
☆315Jul 15, 2026Updated last week
microsoft / AttentionEngine
View on GitHub
☆123May 19, 2025Updated last year
YJMSTR / flash-linear-attention
View on GitHub
FLA but cuTile
☆27Apr 17, 2026Updated 3 months ago
catswe / flash-attention-residuals
View on GitHub
Triton kernels and PyTorch ops for Block Attention Residuals (AttnRes)
☆86May 29, 2026Updated last month
yzlnew / infra-skills
View on GitHub
A collection of specialized agent skills for AI infrastructure development, enabling Claude Code to write, optimize, and debug high-perfo…
☆139Jul 9, 2026Updated last week
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
inclusionAI / asystem-amem
View on GitHub
A NCCL extension library, designed to efficiently offload GPU memory allocated by the NCCL communication library.
☆110Dec 17, 2025Updated 7 months ago
xinhao-luo / ClusterFusion
View on GitHub
[NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
☆75Dec 11, 2025Updated 7 months ago
uccl-project / mKernel
View on GitHub
mKernel: fast multi-node, multi-GPU fused kernels
☆252Jun 21, 2026Updated last month
serdes21 / flashtile
View on GitHub
FlashTile is a CUDA Tile IR compiler that is compatible with NVIDIA's tileiras, targeting SM70 through SM121 NVIDIA GPUs.
☆61Feb 6, 2026Updated 5 months ago
MoonshotAI / FlashKDA
View on GitHub
FlashKDA: high-performance Kimi Delta Attention kernels
☆464May 26, 2026Updated last month
HPMLL / NVIDIA-Hopper-Benchmark
View on GitHub
☆115May 31, 2025Updated last year
ColfaxResearch / cutlass-kernels
View on GitHub
☆269Jul 11, 2024Updated 2 years ago