facebookincubator / dynologLinks
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆345Updated last week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below
Sorting:
- CUDA checkpoint and restore utility☆373Updated 3 weeks ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆130Updated 6 months ago
- Meta's fleetwide profiler framework☆320Updated 2 weeks ago
- ☆71Updated 7 months ago
- NVIDIA Inference Xfer Library (NIXL)☆654Updated this week
- AI/GPU flame graph☆186Updated 2 weeks ago
- DCPerf benchmark suite for hyperscale cloud applications☆207Updated this week
- KV cache store for distributed LLM inference☆338Updated 3 weeks ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆120Updated last year
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆151Updated this week
- cricket is a virtualization solution for GPUs☆218Updated 3 weeks ago
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆437Updated this week
- Unified Collective Communication Library☆277Updated last week
- NVIDIA NCCL Tests for Distributed Training☆112Updated last week
- A validation and profiling tool for AI infrastructure☆338Updated this week
- Ultra and Unified CCL☆578Updated this week
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆286Updated this week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆594Updated last month
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆417Updated this week
- NVIDIA GPUDirect Storage Driver☆285Updated last month
- Awesome utilities for performance profiling☆193Updated 7 months ago
- A tool to detect infrastructure issues on cloud native AI systems☆47Updated 2 weeks ago
- A tool for bandwidth measurements on NVIDIA GPUs.☆541Updated 5 months ago
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆226Updated this week
- RDMA and SHARP plugins for nccl library