osayamenja / FlashMoELinks

Distributed MoE in a Single Kernel [NeurIPS '25]

☆145

Alternatives and similar repositories for FlashMoE

Users that are interested in FlashMoE are comparing it to the libraries listed below

Sorting:

ByteDance-Seed / cudaLLM
☆125Updated 3 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆131Updated 6 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 6 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆108Updated 8 months ago
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆144Updated 2 months ago
flashinfer-ai / cutlass-viz
☆65Updated 7 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆155Updated last month
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆188Updated last month
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆353Updated 4 months ago
ruipeterpan / marconi
Artifact for "Marconi: Prefix Caching for the Era of Hybrid LLMs" [MLSys '25 Outstanding Paper Award, Honorable Mention]
☆46Updated 8 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆97Updated 11 months ago
Victarry / PP-Schedule-Visualization
Pipeline Parallelism Emulation and Visualization
☆72Updated 5 months ago
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆143Updated 3 weeks ago
mit-han-lab / flash-moba
☆187Updated last week
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆120Updated 2 months ago
tile-ai / AttentionEngine
☆51Updated 6 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated last year
thunlp / Seq1F1B
Sequence-level 1F1B schedule for LLMs.
☆37Updated 3 months ago
microsoft / AttentionEngine
☆113Updated 6 months ago
Infini-AI-Lab / vortex_torch
Vortex: A Flexible and Efficient Sparse Attention Framework
☆33Updated this week
HanGuo97 / hilt
☆38Updated this week
microsoft / SparTA
☆159Updated last year
stepfun-ai / StepMesh
☆324Updated 2 weeks ago
DD-DuDa / BitDecoding
[HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆63Updated last week
zhuzilin / flash-attention-with-sink
☆39Updated 3 months ago
thu-ml / Jetfire-INT8Training
☆60Updated last year
nex-agi / NexVenusCL
Nex Venus Communication Library
☆59Updated 2 weeks ago
fzyzcjy / torch_utils
Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocatio…
☆72Updated 2 months ago
flashinfer-ai / flashinfer-bench
Building the Virtuous Cycle for AI-driven LLM Systems
☆92Updated 2 weeks ago
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆59Updated 8 months ago