muriloboratto / NCCLLinks

Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, allGather, reduceScatter and sendRecv operations.

☆34

Alternatives and similar repositories for NCCL

Users that are interested in NCCL are comparing it to the libraries listed below

Sorting:

wzsh / wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆138Updated 4 years ago
CalebDu / Awesome-Cute
☆87Updated 2 months ago
ColfaxResearch / cfx-article-src
☆124Updated 2 months ago
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆99Updated last year
FZJ-JSC / tutorial-multi-gpu
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
☆280Updated last month
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆87Updated 2 months ago
gty111 / GEMM_MMA
Optimize GEMM with tensorcore step by step
☆29Updated last year
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆184Updated 5 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆148Updated last month
leimao / CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
☆202Updated last year
sunlex0717 / DissectingTensorCores
☆104Updated last year
NVIDIA / nsight-training
Training material for Nsight developer tools
☆161Updated 11 months ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆206Updated 2 weeks ago
reed-lau / cute-gemm
☆125Updated 7 months ago
Bruce-Lee-LY / cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
☆63Updated 10 months ago
Oneflow-Inc / dfccl
☆26Updated 5 months ago
muriloboratto / NVSHEMEM
Sample Codes using NVSHMEM on Multi-GPU
☆15Updated 2 years ago
yifuwang / symm-mem-recipes
☆100Updated 6 months ago
ROCm / rocSHMEM
rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
☆92Updated this week
AyakaGEMM / Hands-on-GEMM
☆137Updated last year
c3sr / tcu_scope
☆51Updated 6 years ago
sjfeng1999 / gpu-arch-microbenchmark
Dissecting NVIDIA GPU Architecture
☆101Updated 3 years ago
codyjrivera / tsm2x-imp
Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA
☆33Updated 4 years ago
1duo / nccl-examples
NCCL Examples from Official NVIDIA NCCL Developer Guide.
☆17Updated 7 years ago
OpenPPL / ppl.llm.kernel.cuda
☆149Updated 6 months ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆111Updated 10 months ago
nicolaswilde / cuda-tensorcore-hgemm
☆148Updated 6 months ago
leimao / CUTLASS-Examples
CUTLASS and CuTe Examples
☆63Updated this week
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆95Updated 6 years ago
HPMLL / NVIDIA-Hopper-Benchmark
☆50Updated last month