Dao-AILab/sonic-moe

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/Dao-AILab/sonic-moe)

Dao-AILab / sonic-moe

Accelerating MoE with IO and Tile-aware Optimizations

☆597

Alternatives and similar repositories for sonic-moe

Users that are interested in sonic-moe are comparing it to the libraries listed below

Sorting:

ByteDance-Seed / Triton-distributed
View on GitHub
Distributed Compiler based on Triton for Parallel Systems
☆1,371Feb 13, 2026Updated 3 weeks ago
tile-ai / tilelang
View on GitHub
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
☆5,284Feb 28, 2026Updated last week
microsoft / AttentionEngine
View on GitHub
☆118May 19, 2025Updated 9 months ago
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆5,057Updated this week
fla-org / flash-linear-attention
View on GitHub
🚀 Efficient implementations of state-of-the-art linear attention models
☆4,474Updated this week
sspec-project / SparseSpec
View on GitHub
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
☆93Dec 2, 2025Updated 3 months ago
HazyResearch / ThunderKittens
View on GitHub
Tile primitives for speedy kernels
☆3,202Feb 24, 2026Updated last week
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,264Aug 28, 2025Updated 6 months ago
open-lm-engine / lm-engine
View on GitHub
LM engine is a library for pretraining/finetuning LLMs
☆126Updated this week
fla-org / native-sparse-attention
View on GitHub
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
☆969Feb 5, 2026Updated last month
perplexityai / pplx-kernels
View on GitHub
Perplexity GPU Kernels
☆567Nov 7, 2025Updated 4 months ago
OpenSparseLLMs / Linear-MoE
View on GitHub
☆129Jun 6, 2025Updated 9 months ago
svg-project / flash-kmeans
View on GitHub
Fast and memory-efficient exact kmeans
☆140Feb 18, 2026Updated 2 weeks ago
NVIDIA / TransformerEngine
View on GitHub
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on H…
☆3,176Feb 28, 2026Updated last week
zhuzilin / flash-attention-with-sink
View on GitHub
☆38Aug 7, 2025Updated 7 months ago
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,025Sep 4, 2024Updated last year
nex-agi / NexVenusCL
View on GitHub
Nex Venus Communication Library
☆72Nov 17, 2025Updated 3 months ago
fanshiqing / grouped_gemm
View on GitHub
PyTorch bindings for CUTLASS grouped GEMM.
☆185Feb 19, 2026Updated 2 weeks ago
mit-han-lab / Block-Sparse-Attention
View on GitHub
A sparse attention kernel supporting mix sparse patterns
☆472Jan 18, 2026Updated last month
ademeure / DeeperGEMM
View on GitHub
DeeperGEMM: crazy optimized version
☆74May 5, 2025Updated 10 months ago
Dao-AILab / quack
View on GitHub
A Quirky Assortment of CuTe Kernels
☆838Updated this week
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆165Feb 11, 2026Updated 3 weeks ago
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆817Mar 6, 2025Updated last year
xlite-dev / ffpa-attn
View on GitHub
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆255Feb 13, 2026Updated 3 weeks ago
pytorch / helion
View on GitHub
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆774Updated this week
thu-ml / SageAttention
View on GitHub
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-t…
☆3,192Jan 17, 2026Updated last month
deepseek-ai / profile-data
View on GitHub
Analyze computation-communication overlap in V3/R1.
☆1,143Mar 21, 2025Updated 11 months ago
hao-ai-lab / DistCA
View on GitHub
Efficient Long-context Language Model Training by Core Attention Disaggregation
☆92Updated this week
triton-lang / Triton-to-tile-IR
View on GitHub
incubator repo for CUDA-TileIR backend
☆109Feb 14, 2026Updated 3 weeks ago
melonedo / algebraic-layouts
View on GitHub
☆20Aug 20, 2025Updated 6 months ago
mirage-project / mirage
View on GitHub
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
☆2,148Feb 23, 2026Updated last week
tile-ai / TileOPs
View on GitHub
☆88Updated this week
NVIDIA / nvshmem
View on GitHub
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…
☆469Feb 28, 2026Updated last week
NVIDIA / TileGym
View on GitHub
Helpful kernel tutorials and examples for tile-based GPU programming
☆659Updated this week
deepseek-ai / DeepGEMM
View on GitHub
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆6,206Feb 27, 2026Updated last week
mit-han-lab / duo-attention
View on GitHub
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆527Feb 10, 2025Updated last year
mit-han-lab / flash-moba
View on GitHub
☆226Nov 19, 2025Updated 3 months ago
mit-han-lab / Quest
View on GitHub
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆374Jul 10, 2025Updated 7 months ago
ArthurinRUC / cutlass-notes
View on GitHub
From Minimal GEMM to Everything
☆163Feb 10, 2026Updated 3 weeks ago