YinLiu-91 / ncclOperationPlusLinks

use ncclSend ncclRecv realize ncclSendrecv ncclGather ncclScatter ncclAlltoall

☆8

Alternatives and similar repositories for ncclOperationPlus

Users that are interested in ncclOperationPlus are comparing it to the libraries listed below

Sorting:

Mellanox / gpu_direct_rdma_access
example code for using DC QP for providing RDMA READ and WRITE operations to remote GPU memory
☆137Updated 11 months ago
BG2BKK / my_benchmark
benchmark for linux server
☆13Updated 8 years ago
StarryVae / RDMA-tutorial
☆190Updated 2 years ago
mellanox-hpc / libibprof
verbs profiling library
☆22Updated last year
muriloboratto / NCCL
Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, all…
☆34Updated last year
feifeibear / swGEMM
A highly efficient library for GEMM operations on Sunway TaihuLight
☆18Updated 4 years ago
FZJ-JSC / tutorial-multi-gpu
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
☆282Updated last month
csl-iisc / GPM-ASPLOS22
☆36Updated last year
c3sr / comm_scope
NUMA-aware multi-CPU multi-GPU data transfer benchmarks
☆23Updated last year
poojahira / spmv-cuda
Implementation and analysis of five different GPU based SPMV algorithms in CUDA
☆41Updated 6 years ago
1duo / nccl-examples
NCCL Examples from Official NVIDIA NCCL Developer Guide.
☆17Updated 7 years ago
cyanguwa / nersc-roofline
☆45Updated 4 years ago
gpudirect / libgdsync
GPUDirect Async support for IB Verbs
☆127Updated 2 years ago
eniac / paella
Paella: Low-latency Model Serving with Virtualized GPU Scheduling
☆60Updated last year
NVIDIA / df-nvshmem-prototype
Prototype of OpenSHMEM for NVIDIA GPUs, developed as part of DoE Design Forward
☆25Updated 7 years ago
eth-cscs / Tiled-MM
Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.
☆33Updated 3 months ago
rkhan055 / SHADE
SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training
☆35Updated 2 years ago
merthidayetoglu / HiCCL
A hierarchical collective communications library with portable optimizations
☆35Updated 7 months ago
Oneflow-Inc / dfccl
☆26Updated 5 months ago
AI-HPC-Research-Team / AIPerf
Automated machine learning as an AI-HPC benchmark
☆66Updated 3 years ago
astra-sim / tacos
TACOS: [T]opology-[A]ware [Co]llective Algorithm [S]ynthesizer for Distributed Machine Learning
☆25Updated last month
uuudown / Tartan
Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite
☆66Updated 6 years ago
wzsh / wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆138Updated 4 years ago
HaifengSun-Kira / RDMA-Tutorial
☆36Updated 3 years ago
XpuOS / xsched
A preemptive scheduling framework for diverse XPUs, including GPUs, NPUs, ASICs, and FPGAs
☆69Updated this week
lipracer / cuda-rt-hook
☆38Updated last week
openucx / ucc
Unified Collective Communication Library
☆261Updated last week
merthidayetoglu / CommBench
A Micro-benchmarking Tool for HPC Networks
☆31Updated 3 weeks ago
AlphaSparse / Library
A sparse BLAS lib supporting multiple backends
☆44Updated 5 months ago
casys-kaist / HUVM
☆24Updated 2 years ago