facebookincubator / dynolog
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆310Updated this week
Alternatives and similar repositories for dynolog:
Users that are interested in dynolog are comparing it to the libraries listed below
- CUDA checkpoint and restore utility☆322Updated 2 months ago
- NVIDIA Inference Xfer Library (NIXL)☆255Updated this week
- Meta's fleetwide profiler framework☆297Updated 5 months ago
- DCPerf benchmark suite for hyperscale cloud applications☆162Updated last week
- KV cache store for distributed LLM inference☆136Updated 2 weeks ago
- NCCL Profiling Kit☆129Updated 9 months ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆116Updated last year
- ☆58Updated 2 months ago
- cricket is a virtualization solution for GPUs☆191Updated last month
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆331Updated this week
- A library to analyze PyTorch traces.☆366Updated this week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆134Updated this week
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆339Updated this week
- A tool for bandwidth measurements on NVIDIA GPUs.☆401Updated 2 months ago
- NVIDIA NCCL Tests for Distributed Training☆88Updated last week
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆84Updated 2 weeks ago
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆139Updated last week
- RDMA and SHARP plugins for nccl library☆188Updated last week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆492Updated last month
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆150Updated last year
- ☆301Updated 7 months ago
- System performance analysis and characterization tool☆372Updated this week
- Perplexity GPU Kernels☆204Updated last week
- HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of…☆142Updated 2 weeks ago
- Unified Collective Communication Library☆246Updated this week
- This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.☆168Updated this week
- The local version of the backend and UI for the gProfiler agent, featuring advanced flamegraph analysis tools. For the also free cloud ve…☆178Updated this week
- A GPU-driven system framework for scalable AI applications☆114Updated 2 months ago
- RAPIDS Memory Manager☆569Updated this week
- ☆48Updated last month