facebookincubator / dynologLinks
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆313Updated last week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below
Sorting:
- Meta's fleetwide profiler framework☆305Updated 2 weeks ago
- CUDA checkpoint and restore utility☆339Updated 4 months ago
- DCPerf benchmark suite for hyperscale cloud applications☆178Updated this week
- NVIDIA Inference Xfer Library (NIXL)☆352Updated this week
- AI/GPU flame graph☆146Updated 3 weeks ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆116Updated last year
- KV cache store for distributed LLM inference☆250Updated this week
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆360Updated this week
- NCCL Profiling Kit☆134Updated 11 months ago
- RDMA and SHARP plugins for nccl library☆193Updated last month
- NVIDIA GPUDirect Storage Driver☆246Updated last month
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆362Updated this week
- cricket is a virtualization solution for GPUs☆197Updated last month
- Unified Collective Communication Library☆253Updated last week
- Fine-grained GPU sharing primitives☆141Updated 5 years ago
- A library to analyze PyTorch traces.☆379Updated this week
- This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.☆172Updated last week
- Microsoft Collective Communication Library☆346Updated last year
- ☆61Updated 3 months ago
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆140Updated this week
- NVIDIA NCCL Tests for Distributed Training☆91Updated last week
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆96Updated 2 months ago
- Magnum IO community repo☆95Updated 2 weeks ago
- The local version of the backend and UI for the gProfiler agent, featuring advanced flamegraph analysis tools. For the also free cloud ve…☆181Updated this week
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆83Updated last year
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆515Updated 3 weeks ago
- A validation and profiling tool for AI infrastructure☆312Updated this week
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆156Updated last year
- oneAPI Collective Communications Library (oneCCL)☆234Updated last week
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆169Updated this week