coreweave / nccl-testsLinks

NVIDIA NCCL Tests for Distributed Training

☆100

Alternatives and similar repositories for nccl-tests

Users that are interested in nccl-tests are comparing it to the libraries listed below

Sorting:

NVIDIA / mig-parted
MIG Partition Editor for NVIDIA GPUs
☆204Updated last week
NVIDIA / DCGM
NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
☆551Updated 2 months ago
NVIDIA / cuda-checkpoint
CUDA checkpoint and restore utility
☆353Updated 6 months ago
Mellanox / nccl-rdma-sharp-plugins
RDMA and SHARP plugins for nccl library
☆199Updated last month
NTHU-LSALAB / Gemini
An efficient GPU resource sharing system with fine-grained control for Linux platforms.
☆84Updated last year
google / nccl-fastsocket
NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.
☆118Updated last year
ai-dynamo / nixl
NVIDIA Inference Xfer Library (NIXL)
☆491Updated this week
Bruce-Lee-LY / cuda_hook
Hooked CUDA-related dynamic libraries by using automated code generation tools.
☆160Updated last year
IBM / autopilot
A tool to detect infrastructure issues on cloud native AI systems
☆44Updated last week
microsoft / NPKit
NCCL Profiling Kit
☆139Updated last year
triton-inference-server / triton_distributed
☆52Updated 4 months ago
Mellanox / ofed-docker
☆43Updated last year
leptonai / gpud
GPUd automates monitoring, diagnostics, and issue identification for GPUs
☆401Updated this week
aws / aws-ofi-nccl
This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
☆181Updated last week
sgl-project / ome
OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)
☆192Updated last week
RWTH-ACS / cricket
cricket is a virtualization solution for GPUs
☆211Updated last month
imbue-ai / cluster-health
☆313Updated 11 months ago
NVIDIA / nvbandwidth
A tool for bandwidth measurements on NVIDIA GPUs.
☆492Updated 3 months ago
microsoft / superbenchmark
A validation and profiling tool for AI infrastructure
☆325Updated last week
Mellanox / nv_peer_memory
☆361Updated last year
NVIDIA / nvidia-resiliency-ext
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …
☆194Updated last week
Mellanox / k8s-rdma-shared-dev-plugin
☆283Updated last week
alibaba / GPU-scheduler-for-deep-learning
GPU-scheduler-for-deep-learning
☆210Updated 4 years ago
kubedl-io / morphling
Automatic tuning for ML model deployment on Kubernetes
☆80Updated 9 months ago
AlibabaPAI / llumnix
Efficient and easy multi-instance LLM serving
☆454Updated this week
Azure / msccl
Microsoft Collective Communication Library
☆63Updated 8 months ago
pokerfaceSad / GPUMounter
A kubernetes plugin which enables dynamically add or remove GPU resources for a running Pod
☆127Updated 3 years ago
AliyunContainerService / et-operator
Kubernetes Operator for AI and Bigdata Elastic Training
☆87Updated 6 months ago
Azure / azurehpc-health-checks
Health checks for Azure N- and H-series VMs.
☆48Updated this week
sakjain92 / Fractional-GPUs
Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions
☆159Updated 6 years ago