facebookincubator / dynologLinks
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆353Updated this week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below
Sorting:
- CUDA checkpoint and restore utility☆390Updated 2 months ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆140Updated 7 months ago
- Meta's fleetwide profiler framework☆328Updated 2 months ago
- AI/GPU flame graph☆190Updated last month
- DCPerf benchmark suite for hyperscale cloud applications☆214Updated last week
- NVIDIA Inference Xfer Library (NIXL)☆721Updated this week
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆122Updated 2 years ago
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆454Updated this week
- KV cache store for distributed LLM inference☆361Updated last week
- Awesome utilities for performance profiling☆196Updated 8 months ago
- ☆72Updated 9 months ago
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆312Updated last week
- cricket is a virtualization solution for GPUs☆223Updated 2 months ago
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆153Updated last week
- A validation and profiling tool for AI infrastructure☆348Updated last week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆610Updated last month
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆437Updated this week
- Unified Collective Communication Library☆279Updated last week
- NVIDIA GPUDirect Storage Driver☆297Updated 3 months ago
- NCCL Profiling Kit☆147Updated last year
- torchcomms: a modern PyTorch communications API☆291Updated this week
- NVIDIA NCCL Tests for Distributed Training☆124Updated last week
- Offline optimization of your disaggregated Dynamo graph☆106Updated this week
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆234Updated last week
- System performance analysis and characterization tool☆413Updated this week
- RDMA and SHARP plugins for nccl library☆212Updated last month
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆169Updated last year
- ☆316Updated last year
- The core library and APIs implementing the Triton Inference Server.☆155Updated last week
- High-performance safetensors model loader☆72Updated last week