Accelerating MoE with IO and Tile-aware Optimizations
☆597Feb 27, 2026Updated last week
Alternatives and similar repositories for sonic-moe
Users that are interested in sonic-moe are comparing it to the libraries listed below
Sorting:
- Distributed Compiler based on Triton for Parallel Systems☆1,371Feb 13, 2026Updated 3 weeks ago
- Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels☆5,284Feb 28, 2026Updated last week
- ☆118May 19, 2025Updated 9 months ago
- FlashInfer: Kernel Library for LLM Serving☆5,057Updated this week
- 🚀 Efficient implementations of state-of-the-art linear attention models☆4,474Updated this week
- Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding☆93Dec 2, 2025Updated 3 months ago
- Tile primitives for speedy kernels☆3,202Feb 24, 2026Updated last week
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.☆1,264Aug 28, 2025Updated 6 months ago
- LM engine is a library for pretraining/finetuning LLMs☆126Updated this week
- 🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"☆969Feb 5, 2026Updated last month
- Perplexity GPU Kernels☆567Nov 7, 2025Updated 4 months ago
- ☆129Jun 6, 2025Updated 9 months ago
- Fast and memory-efficient exact kmeans☆140Feb 18, 2026Updated 2 weeks ago
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on H…☆3,176Feb 28, 2026Updated last week
- ☆38Aug 7, 2025Updated 7 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆1,025Sep 4, 2024Updated last year
- Nex Venus Communication Library☆72Nov 17, 2025Updated 3 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆185Feb 19, 2026Updated 2 weeks ago
- A sparse attention kernel supporting mix sparse patterns☆472Jan 18, 2026Updated last month
- DeeperGEMM: crazy optimized version☆74May 5, 2025Updated 10 months ago
- A Quirky Assortment of CuTe Kernels☆838Updated this week
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆165Feb 11, 2026Updated 3 weeks ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆817Mar 6, 2025Updated last year
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆255Feb 13, 2026Updated 3 weeks ago
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆774Updated this week
- [ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-t…☆3,192Jan 17, 2026Updated last month
- Analyze computation-communication overlap in V3/R1.☆1,143Mar 21, 2025Updated 11 months ago
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆92Updated this week
- incubator repo for CUDA-TileIR backend☆109Feb 14, 2026Updated 3 weeks ago
- ☆20Aug 20, 2025Updated 6 months ago
- Mirage Persistent Kernel: Compiling LLMs into a MegaKernel☆2,148Feb 23, 2026Updated last week
- ☆88Updated this week
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆469Feb 28, 2026Updated last week
- Helpful kernel tutorials and examples for tile-based GPU programming☆659Updated this week
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling☆6,206Feb 27, 2026Updated last week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆527Feb 10, 2025Updated last year
- ☆226Nov 19, 2025Updated 3 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆374Jul 10, 2025Updated 7 months ago
- From Minimal GEMM to Everything☆163Feb 10, 2026Updated 3 weeks ago