Infrawaves / DeepEP_ibrc_dual-ports_multiQPLinks

Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport

☆63

Alternatives and similar repositories for DeepEP_ibrc_dual-ports_multiQP

Users that are interested in DeepEP_ibrc_dual-ports_multiQP are comparing it to the libraries listed below

Sorting:

DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆67Updated this week
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆135Updated 3 weeks ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆179Updated 3 weeks ago
Azure / msccl-executor-nccl
☆46Updated 10 months ago
flashinfer-ai / cutlass-viz
☆65Updated 5 months ago
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆111Updated 4 months ago
shenh10 / DeepSeek_Simulator
☆86Updated 6 months ago
Oneflow-Inc / dfccl
☆27Updated 7 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆72Updated 5 months ago
Ascend / triton-ascend
Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend
☆76Updated 2 weeks ago
HPMLL / NVIDIA-Hopper-Benchmark
☆57Updated 4 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆97Updated 3 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆120Updated 5 months ago
antgroup / DeepXTrace
DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.
☆60Updated this week
microsoft / tokenweave
Efficient Compute-Communication Overlap for Distributed LLM Inference
☆58Updated last week
hao-ai-lab / MuxServe
☆72Updated last year
Azure / msccl
Microsoft Collective Communication Library
☆66Updated 10 months ago
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆63Updated 3 weeks ago
ovg-project / kvcached
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
☆99Updated last week
alibaba / llm-scheduling-artifact
Artifact of OSDI '24 paper, ”Llumnix: Dynamic Scheduling for Large Language Model Serving“
☆62Updated last year
abcdabcd987 / libfabric-efa-demo
☆60Updated 9 months ago
ByteDance-Seed / StragglerAnalysis
☆42Updated 5 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆135Updated 8 months ago
Chtholly-Boss / swizzle
A practical way of learning Swizzle
☆28Updated 8 months ago
mcrl / tccl
Thunder Research Group's Collective Communication Library
☆42Updated 3 months ago
TiledTensor / TiledLower
TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.
☆14Updated 10 months ago
rchardx / cuda-gemm
☆28Updated 6 months ago
tile-ai / tilescale
Tile-based language built for AI computation across all scales
☆66Updated 2 weeks ago
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆60Updated this week
flashinfer-ai / debug-print
Debug print operator for cudagraph debugging
☆14Updated last year