NVIDIA/multi-gpu-programming-models

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/NVIDIA/multi-gpu-programming-models)

NVIDIA / multi-gpu-programming-models

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

☆908

Alternatives and similar repositories for multi-gpu-programming-models

Users that are interested in multi-gpu-programming-models are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

FZJ-JSC / tutorial-multi-gpu
View on GitHub
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
☆380Jun 26, 2026Updated 3 weeks ago
NVIDIA / gdrcopy
View on GitHub
A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
☆1,399Updated this week
NVIDIA / nvbench
View on GitHub
CUDA Kernel Benchmarking Library
☆900Updated this week
NVIDIA / nccl
View on GitHub
Optimized primitives for collective multi-GPU communication
☆4,892Updated this week
RRZE-HPC / gpu-benches
View on GitHub
collection of benchmarks to measure basic GPU capabilities
☆530Oct 24, 2025Updated 8 months ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
NVIDIA / jitify
View on GitHub
A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).
☆573Sep 15, 2025Updated 10 months ago
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,343Aug 28, 2025Updated 10 months ago
NVIDIA / nccl-tests
View on GitHub
NCCL Tests
☆1,595Jul 9, 2026Updated last week
NVIDIA / nvshmem
View on GitHub
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…
☆560Updated this week
NVIDIA / nvbandwidth
View on GitHub
A tool for bandwidth measurements on NVIDIA GPUs.
☆732Apr 8, 2026Updated 3 months ago
NVIDIA / cuCollections
View on GitHub
☆654Updated this week
NVIDIA / cub
View on GitHub
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
☆1,840Oct 9, 2023Updated 2 years ago
KuangjuX / NVSHMEM-Tutorial
View on GitHub
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆195Feb 11, 2026Updated 5 months ago
openucx / ucc
View on GitHub
Unified Collective Communication Library
☆310Updated this week
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
NVIDIA / CUDALibrarySamples
View on GitHub
CUDA Library Samples
☆2,463Updated this week
NVIDIA / cutlass
View on GitHub
CUDA Templates and Python DSLs for High-Performance Linear Algebra
☆10,104Updated this week
NVIDIA-developer-blog / code-samples
View on GitHub
Source code examples from the Parallel Forall Blog
☆1,332Sep 23, 2025Updated 9 months ago
ByteDance-Seed / Triton-distributed
View on GitHub
Distributed Compiler based on Triton for Parallel Systems
☆1,493Jul 11, 2026Updated last week
microsoft / mscclpp
View on GitHub
MSCCL++: A GPU-driven communication stack for scalable AI applications
☆541Updated this week
NVIDIA / cccl
View on GitHub
CUDA Core Compute Libraries
☆2,431Updated this week
NVIDIA / cuda-samples
View on GitHub
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
☆9,404May 27, 2026Updated last month
openucx / ucx
View on GitHub
Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
☆1,673Updated this week
meta-pytorch / torchcomms
View on GitHub
torchcomms: a modern PyTorch communications API
☆377Updated this week
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
KnowingNothing / MatmulTutorial
View on GitHub
A Easy-to-understand TensorOp Matmul Tutorial
☆445Mar 5, 2026Updated 4 months ago
microsoft / NPKit
View on GitHub
NCCL Profiling Kit
☆155Jul 1, 2024Updated 2 years ago
eyalroz / cuda-api-wrappers
View on GitHub
Thin, unified, C++-flavored wrappers for the CUDA APIs
☆900Updated this week
UoB-HPC / BabelStream
View on GitHub
STREAM, for lots of devices written in many programming models
☆368Jun 15, 2026Updated last month
microsoft / msccl
View on GitHub
Microsoft Collective Communication Library
☆394Sep 20, 2023Updated 2 years ago
yifuwang / symm-mem-recipes
View on GitHub
☆170Dec 27, 2024Updated last year
Mellanox / nccl-rdma-sharp-plugins
View on GitHub
RDMA and SHARP plugins for nccl library
☆233Apr 3, 2026Updated 3 months ago
Jokeren / Awesome-GPU
View on GitHub
Awesome resources for GPUs
☆634Mar 10, 2026Updated 4 months ago
perplexityai / pplx-kernels
View on GitHub
Perplexity GPU Kernels
☆590Nov 7, 2025Updated 8 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
Mellanox / nv_peer_memory
View on GitHub
☆399Apr 23, 2024Updated 2 years ago
NVIDIA / NVTX
View on GitHub
The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resou…
☆544Updated this week
ColfaxResearch / cutlass-kernels
View on GitHub
☆269Jul 11, 2024Updated 2 years ago
gpudirect / libgdsync
View on GitHub
GPUDirect Async support for IB Verbs
☆139Nov 10, 2022Updated 3 years ago
ai-dynamo / nixl
View on GitHub
NVIDIA Inference Xfer Library (NIXL)
☆1,138Updated this week
microsoft / nnfusion
View on GitHub
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
☆1,002Sep 19, 2024Updated last year
infinigence / FlashOverlap
View on GitHub
A lightweight design for computation-communication overlap.
☆242Jan 20, 2026Updated 5 months ago