google / nccl-fastsocketLinks

NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.

☆121

Alternatives and similar repositories for nccl-fastsocket

Users that are interested in nccl-fastsocket are comparing it to the libraries listed below

Sorting:

Mellanox / nccl-rdma-sharp-plugins
RDMA and SHARP plugins for nccl library
☆209Updated last month
microsoft / NPKit
NCCL Profiling Kit
☆145Updated last year
microsoft / msccl
Microsoft Collective Communication Library
☆367Updated 2 years ago
SymbioticLab / Salus
Fine-grained GPU sharing primitives
☆144Updated 2 months ago
facebookresearch / param
PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…
☆153Updated last week
microsoft / msccl-tools
Synthesizer for optimal collective communication algorithms
☆118Updated last year
Mellanox / nv_peer_memory
☆376Updated last year
NVIDIA / nvidia-resiliency-ext
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …
☆226Updated this week
alibaba / GPU-scheduler-for-deep-learning
GPU-scheduler-for-deep-learning
☆210Updated 4 years ago
Azure / msccl
Microsoft Collective Communication Library
☆66Updated 10 months ago
microsoft / mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
☆425Updated this week
Azure / msccl-executor-nccl
☆46Updated 10 months ago
aws / aws-ofi-nccl
This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
☆186Updated this week
eniac / paella
Paella: Low-latency Model Serving with Virtualized GPU Scheduling
☆62Updated last year
AlibabaPAI / DAPPLE
An Efficient Pipelined Data Parallel Approach for Training Large Model
☆76Updated 4 years ago
netx-repo / PipeSwitch
PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications
☆126Updated 3 years ago
NVIDIA-Merlin / HierarchicalKV
HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…
☆173Updated 3 weeks ago
uwsampl / nexus
☆83Updated 4 months ago
coreweave / nccl-tests
NVIDIA NCCL Tests for Distributed Training
☆114Updated last week
openucx / torch-ucc
pytorch ucc plugin
☆23Updated 4 years ago
ai-dynamo / nixl
NVIDIA Inference Xfer Library (NIXL)
☆673Updated this week
ai-dynamo / aiconfigurator
Offline optimization of your disaggregated Dynamo graph
☆79Updated this week
uccl-project / uccl
Ultra and Unified CCL
☆595Updated this week
SymbioticLab / Tiresias
Tiresias is a GPU cluster manager for distributed deep learning training.
☆162Updated 5 years ago
parasailteam / coconet
☆83Updated 2 years ago
NVIDIA / cloudai
CloudAI Benchmark Framework
☆71Updated last week
stanford-futuredata / gavel
Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
☆130Updated last year
microsoft / superbenchmark
A validation and profiling tool for AI infrastructure
☆341Updated this week
mcrl / tccl
Thunder Research Group's Collective Communication Library
☆42Updated 3 months ago
sakjain92 / Fractional-GPUs
Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions
☆159Updated 6 years ago