osayamenja / FlashMoELinks
Distributed MoE in a Single Kernel [NeurIPS '25]
☆145Updated 2 months ago
Alternatives and similar repositories for FlashMoE
Users that are interested in FlashMoE are comparing it to the libraries listed below
Sorting:
- ☆125Updated 3 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆131Updated 6 months ago
- DeeperGEMM: crazy optimized version☆73Updated 6 months ago
- 16-fold memory access reduction with nearly no loss☆108Updated 8 months ago
- NVSHMEM ‑Tutorial: Build a DeepEP‑like GPU Buffer☆144Updated 2 months ago
- ☆65Updated 7 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆155Updated last month
- A lightweight design for computation-communication overlap.☆188Updated last month
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆353Updated 4 months ago
- Artifact for "Marconi: Prefix Caching for the Era of Hybrid LLMs" [MLSys '25 Outstanding Paper Award, Honorable Mention]☆46Updated 8 months ago
- Implement Flash Attention using Cute.☆97Updated 11 months ago
- Pipeline Parallelism Emulation and Visualization☆72Updated 5 months ago
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆143Updated 3 weeks ago
- ☆187Updated last week
- nnScaler: Compiling DNN models for Parallel Training☆120Updated 2 months ago
- ☆51Updated 6 months ago
- Quantized Attention on GPU☆44Updated last year
- Sequence-level 1F1B schedule for LLMs.☆37Updated 3 months ago
- ☆113Updated 6 months ago
- Vortex: A Flexible and Efficient Sparse Attention Framework☆33Updated this week
- ☆38Updated this week
- ☆159Updated last year
- ☆324Updated 2 weeks ago
- [HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆63Updated last week
- ☆39Updated 3 months ago
- ☆60Updated last year
- Nex Venus Communication Library☆59Updated 2 weeks ago
- Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocatio…☆72Updated 2 months ago
- Building the Virtuous Cycle for AI-driven LLM Systems☆92Updated 2 weeks ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆59Updated 8 months ago