Distributed MoE in a Single Kernel [NeurIPS '25]
☆194Feb 27, 2026Updated this week
Alternatives and similar repositories for FlashMoE
Users that are interested in FlashMoE are comparing it to the libraries listed below
Sorting:
- Expert Specialization MoE Solution based on CUTLASS☆27Jan 19, 2026Updated last month
- ☆160Dec 27, 2024Updated last year
- ☆226Nov 19, 2025Updated 3 months ago
- Distributed Compiler based on Triton for Parallel Systems☆1,371Feb 13, 2026Updated 2 weeks ago
- Awesome Triton Resources☆39Apr 27, 2025Updated 10 months ago
- ☆87Jan 22, 2026Updated last month
- Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding☆93Dec 2, 2025Updated 3 months ago
- Efficient GPU communication over multiple NICs.☆24Nov 20, 2025Updated 3 months ago
- [NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive☆66Dec 11, 2025Updated 2 months ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆475Updated this week
- CUTLASS and CuTe Examples☆132Nov 30, 2025Updated 3 months ago
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆91Feb 23, 2026Updated last week
- Make triton easier☆50Jun 12, 2024Updated last year
- Pytorch routines for (Ker)nel (Mac)hines☆10Oct 10, 2025Updated 4 months ago
- ☆11Nov 14, 2023Updated 2 years ago
- A NCCL extension library, designed to efficiently offload GPU memory allocated by the NCCL communication library.☆98Dec 17, 2025Updated 2 months ago
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- BigBang-Proton is a LLM pretrained on cross-scale, cross-structure, cross-discipline real-world scientific tasks to construct a scienti…☆22Nov 8, 2025Updated 3 months ago
- An experimental communicating attention kernel based on DeepEP.☆35Jul 29, 2025Updated 7 months ago
- ☆178May 7, 2025Updated 9 months ago
- Next-generation datacenter OS built on kernel bypass to speed up unmodified code while improving platform density and security☆120Feb 21, 2026Updated last week
- https://bbuf.github.io/gpu-glossary-zh/☆26Nov 7, 2025Updated 3 months ago
- RPCNIC: A High-Performance and Reconfigurable PCIe-attached RPC Accelerator [HPCA2025]☆13Dec 9, 2024Updated last year
- ☆18Nov 11, 2025Updated 3 months ago
- ☆347Jan 28, 2026Updated last month
- An official implementation of Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards☆37Oct 3, 2025Updated 5 months ago
- EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs☆46Sep 19, 2025Updated 5 months ago
- VehicleWorld is the first comprehensive multi-device environment for intelligent vehicle interaction that accurately models the complex, …☆21Sep 16, 2025Updated 5 months ago
- Artifact for IPDPS'21: DSXplore: Optimizing Convolutional Neural Networks via Sliding-Channel Convolutions.☆13Apr 6, 2021Updated 4 years ago
- [KernelGYM & Dr. Kernel] A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations☆90Feb 6, 2026Updated 3 weeks ago
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆163Feb 11, 2026Updated 3 weeks ago
- Perplexity open source garden for inference technology☆371Dec 25, 2025Updated 2 months ago
- [Archived] For the latest updates and community contribution, please visit: https://github.com/Ascend/TransferQueue or https://gitcode.co…☆13Jan 16, 2026Updated last month
- ☆12May 13, 2025Updated 9 months ago
- My attempt to improve the speed of the newton schulz algorithm, starting from the dion implementation.☆32Dec 5, 2025Updated 2 months ago
- A Triton-only attention backend for vLLM☆24Feb 11, 2026Updated 3 weeks ago
- High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)☆30Jan 22, 2026Updated last month
- ☆134May 29, 2025Updated 9 months ago
- Perplexity GPU Kernels☆567Nov 7, 2025Updated 3 months ago