facebookincubator / dynolog
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆302Updated this week
Alternatives and similar repositories for dynolog:
Users that are interested in dynolog are comparing it to the libraries listed below
- CUDA checkpoint and restore utility☆310Updated last month
- DCPerf benchmark suite for hyperscale cloud applications☆161Updated this week
- Meta's fleetwide profiler framework☆275Updated 4 months ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆116Updated last year
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆311Updated this week
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆76Updated last month
- NCCL Profiling Kit☆127Updated 8 months ago
- A validation and profiling tool for AI infrastructure☆302Updated this week
- RDMA and SHARP plugins for nccl library☆183Updated 2 months ago
- cricket is a virtualization solution for GPUs☆187Updated last month
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆480Updated last month
- A distributed KV store for disaggregated LLM inference☆62Updated this week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆132Updated this week
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆149Updated last year
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆286Updated this week
- The local version of the backend and UI for the gProfiler agent, featuring advanced flamegraph analysis tools. For the also free cloud ve…☆176Updated this week
- Awesome utilities for performance profiling☆167Updated 2 weeks ago
- This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.☆167Updated this week
- Unified Collective Communication Library☆237Updated this week
- A library to analyze PyTorch traces.☆348Updated last week
- NVIDIA NCCL Tests for Distributed Training☆85Updated last week
- NVIDIA GPUDirect Storage Driver☆231Updated 3 months ago
- Microsoft Collective Communication Library☆343Updated last year
- An I/O benchmark for deep Learning applications☆82Updated 2 weeks ago
- oneAPI Collective Communications Library (oneCCL)☆225Updated 2 weeks ago
- A tool for bandwidth measurements on NVIDIA GPUs.☆391Updated last month
- System performance analysis and characterization tool☆370Updated this week
- A GPU-driven system framework for scalable AI applications☆113Updated last month
- ☆296Updated 7 months ago