muriloboratto / NCCL
Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, allGather, reduceScatter and sendRecv operations.
☆28Updated last year
Alternatives and similar repositories for NCCL:
Users that are interested in NCCL are comparing it to the libraries listed below
- ☆78Updated 4 months ago
- ☆66Updated 3 weeks ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆122Updated 4 years ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆54Updated 4 months ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆62Updated 6 years ago
- ☆46Updated 5 years ago
- CUTLASS and CuTe Examples☆35Updated 2 weeks ago