Oneflow-Inc / dfcclLinks

☆27

Alternatives and similar repositories for dfccl

Users that are interested in dfccl are comparing it to the libraries listed below

Sorting:

Infrawaves / DeepEP_ibrc_dual-ports_multiQP
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
☆57Updated 2 months ago
parasailteam / coconet
☆80Updated 2 years ago
HPMLL / NVIDIA-Hopper-Benchmark
☆50Updated 2 months ago
mcrl / tccl
Thunder Research Group's Collective Communication Library
☆39Updated 3 weeks ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆154Updated last month
eniac / paella
Paella: Low-latency Model Serving with Virtualized GPU Scheduling
☆60Updated last year
Lin-Mao / DrGPUM
A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.
☆25Updated 9 months ago
muriloboratto / NVSHEMEM
Sample Codes using NVSHMEM on Multi-GPU
☆23Updated 2 years ago
SJTU-IPADS / reef
REEF is a GPU-accelerated DNN inference serving system that enables instant kernel preemption and biased concurrent execution in GPU sche…
☆96Updated 2 years ago
quiver-team / quiver-feature
High performance RDMA-based distributed feature collection component for training GNN model on EXTREMELY large graph
☆54Updated 3 years ago
Azure / msccl
Microsoft Collective Communication Library
☆63Updated 8 months ago
Azure / msccl-executor-nccl
☆37Updated 7 months ago
microsoft / NPKit
NCCL Profiling Kit
☆139Updated last year
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 6 months ago
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆103Updated 2 months ago
ByteDance-Seed / StragglerAnalysis
☆38Updated 3 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆91Updated last month
shenh10 / DeepSeek_Simulator
☆83Updated 4 months ago
TiledTensor / TiledLower
TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.
☆14Updated 8 months ago
tile-ai / tilescale
Tile-based language built for AI computation across all scales
☆30Updated last week
google / iopddl
Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning
☆23Updated 2 months ago
sjtu-epcc / Tacker
Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS
☆31Updated 5 months ago
S-Lab-System-Group / Awesome-ML-for-System
SOTA Learning-augmented Systems
☆36Updated 3 years ago
SJTU-IPADS / reef-artifacts
A GPU-accelerated DNN inference serving system that supports instant kernel preemption and biased concurrent execution in GPU scheduling.
☆42Updated 3 years ago
antgroup / DeepXTrace
DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.
☆38Updated this week
alibaba / llm-scheduling-artifact
Artifact of OSDI '24 paper, ”Llumnix: Dynamic Scheduling for Large Language Model Serving“
☆62Updated last year
sunlex0717 / DissectingTensorCores
☆106Updated last year
rchardx / cuda-gemm
☆25Updated 4 months ago
yuyangJin / PerFlow-AI
PerFlow-AI is a programmable performance analysis, modeling, prediction tool for AI system.
☆21Updated 3 months ago
abcdabcd987 / libfabric-efa-demo
☆48Updated 6 months ago