facebookincubator / dynolog
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆296Updated this week
Alternatives and similar repositories for dynolog:
Users that are interested in dynolog are comparing it to the libraries listed below
- CUDA checkpoint and restore utility☆289Updated 3 weeks ago
- Meta's fleetwide profiler framework☆104Updated 3 months ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆115Updated last year
- NCCL Profiling Kit☆127Updated 7 months ago
- DCPerf benchmark suite for hyperscale cloud applications☆157Updated this week
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆297Updated this week
- cricket is a virtualization solution for GPUs☆181Updated this week
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆145Updated last year
- Unified Collective Communication Library☆227Updated this week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆460Updated this week
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆50Updated this week
- A library to analyze PyTorch traces.☆332Updated last week
- RDMA and SHARP plugins for nccl library☆176Updated 3 weeks ago
- A tool for bandwidth measurements on NVIDIA GPUs.☆364Updated last week
- Microsoft Collective Communication Library☆333Updated last year
- oneAPI Collective Communications Library (oneCCL)☆222Updated 3 weeks ago
- A validation and profiling tool for AI infrastructure☆292Updated this week
- NVIDIA GPUDirect Storage Driver☆222Updated 2 months ago
- Awesome utilities for performance profiling☆159Updated last year
- A distributed KV store for disaggregated LLM inference☆31Updated this week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆128Updated this week
- NVIDIA NCCL Tests for Distributed Training☆79Updated 3 weeks ago
- This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.☆163Updated this week
- A GPU-driven system framework for scalable AI applications☆112Updated 2 weeks ago
- MLPerf™ Storage Benchmark Suite☆117Updated 6 months ago
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆276Updated this week
- Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions☆157Updated 5 years ago
- An I/O benchmark for deep Learning applications☆76Updated this week
- Senpai is an automated memory sizing tool for container applications.☆316Updated last year
- HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…☆138Updated this week