NVIDIA / ncclLinks

Optimized primitives for collective multi-GPU communication

☆4,274

Alternatives and similar repositories for nccl

Users that are interested in nccl are comparing it to the libraries listed below

Sorting:

NVIDIA / nccl-tests
NCCL Tests
☆1,343Updated 2 weeks ago
pytorch / gloo
Collective communications library with various primitives for multi-machine training.
☆1,372Updated this week
NVIDIA / gdrcopy
A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
☆1,285Updated 3 months ago
NVIDIA / TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on H…
☆2,971Updated this week
pytorch / FBGEMM
FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
☆1,487Updated last week
mlcommons / training
Reference implementations of MLPerf® training benchmarks
☆1,729Updated last week
NVIDIA / cutlass
CUDA Templates and Python DSLs for High-Performance Linear Algebra
☆8,865Updated last week
NVIDIA / cub
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
☆1,807Updated 2 years ago
flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
☆4,168Updated this week
mlcommons / inference
Reference implementations of MLPerf® inference benchmarks
☆1,495Updated last week
openucx / ucx
Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
☆1,514Updated last week
NVIDIA / CUDALibrarySamples
CUDA Library Samples
☆2,219Updated last week
NVIDIA-developer-blog / code-samples
Source code examples from the Parallel Forall Blog
☆1,313Updated 2 months ago
NVIDIA / DALI
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep lear…
☆5,567Updated this week
NVIDIA / cccl
CUDA Core Compute Libraries
☆2,047Updated this week
open-mpi / ompi
Open MPI main development repository
☆2,461Updated last week
pytorch / benchmark
TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.
☆998Updated this week
pytorch / xla
Enabling PyTorch on XLA Devices (e.g. Google TPU)
☆2,716Updated this week
dmlc / dlpack
common in-memory tensor structure
☆1,106Updated last month
uxlfoundation / oneDNN
oneAPI Deep Neural Network Library (oneDNN)
☆3,929Updated this week
NVIDIA / multi-gpu-programming-models
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
☆834Updated 2 months ago
openxla / xla
A machine learning compiler for GPUs, CPUs, and ML accelerators
☆3,757Updated this week
NVIDIA / FasterTransformer
Transformer related optimization, including BERT, GPT
☆6,355Updated last year
tile-ai / tilelang
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
☆4,054Updated this week
kvcache-ai / Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
☆4,357Updated this week
pytorch / kineto
A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.
☆897Updated 2 weeks ago
flexflow / flexflow-train
Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training
☆1,846Updated this week
facebookresearch / fairscale
PyTorch extensions for high performance and large scale training.
☆3,387Updated 7 months ago
baidu-research / baidu-allreduce
☆600Updated 7 years ago
pytorch / glow
Compiler for Neural Network hardware accelerators
☆3,321Updated last year