ppl-ai / libfabric-efa-demoLinks

☆65

Alternatives and similar repositories for libfabric-efa-demo

Users that are interested in libfabric-efa-demo are comparing it to the libraries listed below

Sorting:

NVIDIA / nvidia-resiliency-ext
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …
☆196Updated last week
aws / aws-ofi-nccl
This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
☆182Updated last week
uccl-project / uccl
Ultra and Unified CCL
☆459Updated this week
NVIDIA / cuda-checkpoint
CUDA checkpoint and restore utility
☆357Updated 6 months ago
ai-dynamo / nixl
NVIDIA Inference Xfer Library (NIXL)
☆502Updated this week
microsoft / NPKit
NCCL Profiling Kit
☆140Updated last year
facebookresearch / param
PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…
☆149Updated this week
google / nccl-fastsocket
NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.
☆119Updated last year
ppl-ai / pplx-kernels
Perplexity GPU Kernels
☆418Updated 3 weeks ago
abcdabcd987 / libfabric-efa-demo
☆48Updated 7 months ago
Azure / msccl
Microsoft Collective Communication Library
☆65Updated 8 months ago
microsoft / mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
☆394Updated this week
Mellanox / nccl-rdma-sharp-plugins
RDMA and SHARP plugins for nccl library
☆200Updated last month
microsoft / msccl
Microsoft Collective Communication Library
☆353Updated last year
argonne-lcf / dlio_benchmark
An I/O benchmark for deep Learning applications
☆89Updated last month
bytedance / InfiniStore
KV cache store for distributed LLM inference
☆305Updated 2 months ago
mcrl / tccl
Thunder Research Group's Collective Communication Library
☆39Updated last month
facebookresearch / HolisticTraceAnalysis
A library to analyze PyTorch traces.
☆402Updated this week
coreweave / nccl-tests
NVIDIA NCCL Tests for Distributed Training
☆102Updated 2 weeks ago
suquark / hoplite
☆44Updated 3 years ago
facebookincubator / dynolog
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the…
☆328Updated last week
yifuwang / symm-mem-recipes
☆102Updated 7 months ago
openucx / torch-ucc
pytorch ucc plugin
☆23Updated 4 years ago
imbue-ai / cluster-health
☆314Updated 11 months ago
Azure / msccl-executor-nccl
☆44Updated 7 months ago
merthidayetoglu / HiCCL
A hierarchical collective communications library with portable optimizations
☆36Updated 8 months ago
microsoft / msccl-tools
Synthesizer for optimal collective communication algorithms
☆113Updated last year
NVIDIA / cloudai
CloudAI Benchmark Framework
☆69Updated last week
mk1-project / quickreduce
QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.
☆31Updated 4 months ago
GPUprobe / gpuprobe-daemon
Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes
☆121Updated 4 months ago