Infrawaves / DeepEP_ibrc_dual-ports_multiQPLinks
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
☆63Updated 5 months ago
Alternatives and similar repositories for DeepEP_ibrc_dual-ports_multiQP
Users that are interested in DeepEP_ibrc_dual-ports_multiQP are comparing it to the libraries listed below
Sorting:
- DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit☆67Updated this week
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆135Updated 3 weeks ago
- A lightweight design for computation-communication overlap.☆179Updated 3 weeks ago
- ☆46Updated 10 months ago
- ☆65Updated 5 months ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆111Updated 4 months ago
- ☆86Updated 6 months ago
- ☆27Updated 7 months ago
- DeeperGEMM: crazy optimized version☆72Updated 5 months ago
- Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend☆76Updated 2 weeks ago
- ☆57Updated 4 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆97Updated 3 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆120Updated 5 months ago
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆60Updated this week
- Efficient Compute-Communication Overlap for Distributed LLM Inference☆58Updated last week
- ☆72Updated last year
- Microsoft Collective Communication Library☆66Updated 10 months ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆63Updated 3 weeks ago
- Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond☆99Updated last week
- Artifact of OSDI '24 paper, ”Llumnix: Dynamic Scheduling for Large Language Model Serving“☆62Updated last year
- ☆60Updated 9 months ago
- ☆42Updated 5 months ago
- High performance Transformer implementation in C++.☆135Updated 8 months ago
- A practical way of learning Swizzle☆28Updated 8 months ago
- Thunder Research Group's Collective Communication Library☆42Updated 3 months ago
- TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.☆14Updated 10 months ago
- ☆28Updated 6 months ago
- Tile-based language built for AI computation across all scales☆66Updated 2 weeks ago
- A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆60Updated this week
- Debug print operator for cudagraph debugging☆14Updated last year