fanshiqing/grouped_gemm

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/fanshiqing/grouped_gemm)

fanshiqing / grouped_gemm

PyTorch bindings for CUTLASS grouped GEMM.

☆185

Alternatives and similar repositories for grouped_gemm

Users that are interested in grouped_gemm are comparing it to the libraries listed below

Sorting:

tgale96 / grouped_gemm
View on GitHub
PyTorch bindings for CUTLASS grouped GEMM.
☆144May 29, 2025Updated 9 months ago
perplexityai / pplx-kernels
View on GitHub
Perplexity GPU Kernels
☆567Nov 7, 2025Updated 3 months ago
feifeibear / long-context-attention
View on GitHub
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
☆644Jan 15, 2026Updated last month
ByteDance-Seed / Triton-distributed
View on GitHub
Distributed Compiler based on Triton for Parallel Systems
☆1,371Feb 13, 2026Updated 3 weeks ago
feifeibear / ChituAttention
View on GitHub
Quantized Attention on GPU
☆44Nov 22, 2024Updated last year
sail-sg / zero-bubble-pipeline-parallelism
View on GitHub
Zero Bubble Pipeline Parallelism
☆451May 7, 2025Updated 9 months ago
ademeure / DeeperGEMM
View on GitHub
DeeperGEMM: crazy optimized version
☆74May 5, 2025Updated 10 months ago
RulinShao / LightSeq
View on GitHub
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆222Aug 19, 2024Updated last year
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,264Aug 28, 2025Updated 6 months ago
feifeibear / Odysseus-Transformer
View on GitHub
Odysseus: Playground of LLM Sequence Parallelism
☆79Jun 17, 2024Updated last year
NVIDIA / TransformerEngine
View on GitHub
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on H…
☆3,176Updated this week
flagos-ai / FlagGems
View on GitHub
FlagGems is an operator library for large language models implemented in the Triton Language.
☆909Updated this week
zhuzilin / ring-flash-attention
View on GitHub
Ring attention implementation with flash attention
☆987Sep 10, 2025Updated 5 months ago
rchardx / cuda-gemm
View on GitHub
☆42Nov 1, 2025Updated 4 months ago
ColfaxResearch / cutlass-kernels
View on GitHub
☆262Jul 11, 2024Updated last year
Bruce-Lee-LY / cutlass_gemm
View on GitHub
Multiple GEMM operators are constructed with cutlass to support LLM inference.
☆20Aug 3, 2025Updated 7 months ago
Dao-AILab / grouped-latent-attention
View on GitHub
☆136May 29, 2025Updated 9 months ago
Azure / MS-AMP
View on GitHub
Microsoft Automatic Mixed Precision Library
☆636Dec 1, 2025Updated 3 months ago
apuaaChen / EVT_AE
View on GitHub
Artifacts of EVT ASPLOS'24
☆29Mar 6, 2024Updated 2 years ago
DD-DuDa / Cute-Learning
View on GitHub
Examples of CUDA implementations by Cutlass CuTe
☆269Jul 1, 2025Updated 8 months ago
zhuzilin / flash-attention-with-sink
View on GitHub
☆38Aug 7, 2025Updated 6 months ago
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆79Aug 12, 2024Updated last year
Karbo123 / pytorch_grouped_gemm
View on GitHub
High Performance Grouped GEMM in PyTorch
☆31May 10, 2022Updated 3 years ago
shawntan / scattermoe
View on GitHub
Triton-based implementation of Sparse Mixture of Experts.
☆268Oct 3, 2025Updated 5 months ago
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated 10 months ago
ColfaxResearch / cfx-article-src
View on GitHub
☆178May 7, 2025Updated 9 months ago
Dao-AILab / sonic-moe
View on GitHub
Accelerating MoE with IO and Tile-aware Optimizations
☆597Feb 27, 2026Updated last week
ISEEKYAN / mbridge
View on GitHub
Bridge Megatron-Core to Hugging Face/Reinforcement Learning
☆197Updated this week
IST-DASLab / Quartet
View on GitHub
☆120Jan 8, 2026Updated last month
yanring / Megatron-MoE-ModelZoo
View on GitHub
Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.
☆167Jan 22, 2026Updated last month
mayank31398 / ladder-residual-inference
View on GitHub
☆15Jul 13, 2025Updated 7 months ago
NTT123 / cute-viz
View on GitHub
Cute layout visualization
☆30Jan 18, 2026Updated last month
facebookresearch / HolisticTraceAnalysis
View on GitHub
A library to analyze PyTorch traces.
☆472Feb 4, 2026Updated last month
xlite-dev / HGEMM
View on GitHub
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆147May 10, 2025Updated 9 months ago
SandAI-org / MagiAttention
View on GitHub
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training
☆650Feb 27, 2026Updated last week
Chtholly-Boss / swizzle
View on GitHub
A practical way of learning Swizzle
☆37Feb 3, 2025Updated last year
IBM / triton-dejavu
View on GitHub
Framework to reduce autotune overhead to zero for well known deployments.
☆97Sep 19, 2025Updated 5 months ago
IST-DASLab / qutlass
View on GitHub
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆168Nov 11, 2025Updated 3 months ago
microsoft / Tutel
View on GitHub
Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4
☆969Dec 21, 2025Updated 2 months ago