Infrawaves / DeepEP_ibrc_dual-ports_multiQPLinks
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
☆65Updated 5 months ago
Alternatives and similar repositories for DeepEP_ibrc_dual-ports_multiQP
Users that are interested in DeepEP_ibrc_dual-ports_multiQP are comparing it to the libraries listed below
Sorting:
- ☆46Updated 10 months ago
 - DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit☆73Updated last week
 - NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆142Updated last month
 - A lightweight design for computation-communication overlap.☆182Updated 3 weeks ago
 - ☆65Updated 6 months ago
 - Efficient Compute-Communication Overlap for Distributed LLM Inference☆61Updated this week
 - A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆114Updated 5 months ago
 - Microsoft Collective Communication Library☆67Updated 11 months ago
 - DeeperGEMM: crazy optimized version☆72Updated 5 months ago
 - FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆63Updated last month
 - TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆100Updated 4 months ago
 - ☆90Updated 7 months ago
 - ☆61Updated 5 months ago
 - ☆26Updated 8 months ago
 - ☆43Updated 6 months ago
 - ☆65Updated 9 months ago
 - Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend☆79Updated last month
 - DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆65Updated last week
 - ☆74Updated 2 weeks ago
 - ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆124Updated 5 months ago
 - Thunder Research Group's Collective Communication Library☆42Updated 3 months ago
 - High performance Transformer implementation in C++.☆139Updated 9 months ago
 - ☆31Updated 4 months ago
 - ☆91Updated last week
 - Stateful LLM Serving☆87Updated 7 months ago
 - Debug print operator for cudagraph debugging☆14Updated last year
 - ☆19Updated last year
 - [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆81Updated 4 months ago
 - Tile-based language built for AI computation across all scales☆74Updated last week
 - A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆61Updated 2 weeks ago