facebookincubator / dynolog
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆313Updated last week
Alternatives and similar repositories for dynolog:
Users that are interested in dynolog are comparing it to the libraries listed below
- CUDA checkpoint and restore utility☆330Updated 3 months ago
- Meta's fleetwide profiler framework☆301Updated 6 months ago
- DCPerf benchmark suite for hyperscale cloud applications☆166Updated this week
- NVIDIA Inference Xfer Library (NIXL)☆304Updated this week
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆116Updated last year
- NCCL Profiling Kit☆133Updated 10 months ago
- A library to analyze PyTorch traces.☆367Updated last week
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆92Updated last month
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆345Updated this week
- AI/GPU flame graph☆122Updated this week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆137Updated this week
- RDMA and SHARP plugins for nccl library☆191Updated 3 weeks ago
- KV cache store for distributed LLM inference☆165Updated this week
- NVIDIA NCCL Tests for Distributed Training☆88Updated 2 weeks ago
- Awesome utilities for performance profiling☆171Updated 2 months ago
- A tool for bandwidth measurements on NVIDIA GPUs.☆413Updated 3 weeks ago
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆151Updated last week
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆351Updated this week
- A tool for examining GPU scheduling behavior.☆81Updated 8 months ago
- oneAPI Collective Communications Library (oneCCL)☆232Updated last week
- ☆305Updated 8 months ago
- ☆58Updated 2 months ago
- cricket is a virtualization solution for GPUs☆195Updated 2 weeks ago
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆153Updated last year
- Unified Collective Communication Library☆251Updated last week
- Microsoft Collective Communication Library☆344Updated last year
- The core library and APIs implementing the Triton Inference Server.☆125Updated this week
- This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.☆169Updated this week
- Perplexity GPU Kernels☆272Updated this week
- High-performance safetensors model loader☆25Updated last month