facebookincubator / dynologLinks
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆362Updated this week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below
Sorting:
- CUDA checkpoint and restore utility☆406Updated 4 months ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆146Updated 10 months ago
- Meta's fleetwide profiler framework☆339Updated 4 months ago
- AI/GPU flame graph☆241Updated 3 months ago
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆472Updated last week
- NVIDIA Inference Xfer Library (NIXL)☆844Updated last week
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆122Updated 2 years ago
- KV cache store for distributed LLM inference☆389Updated 2 months ago
- cricket is a virtualization solution for GPUs☆235Updated 4 months ago
- DCPerf benchmark suite for hyperscale cloud applications☆229Updated this week
- A validation and profiling tool for AI infrastructure☆359Updated this week
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆455Updated this week
- NVIDIA GPUDirect Storage Driver☆329Updated last month
- ☆71Updated 11 months ago
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆654Updated last month
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆172Updated 2 years ago
- Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions☆165Updated 6 years ago
- RDMA and SHARP plugins for nccl library☆221Updated 2 weeks ago
- NVIDIA NCCL Tests for Distributed Training☆133Updated this week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆155Updated last week
- Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, T…☆365Updated this week
- This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.☆203Updated last week
- Unified Collective Communication Library☆286Updated last week
- A tool for bandwidth measurements on NVIDIA GPUs.☆610Updated 9 months ago
- NCCL Profiling Kit☆150Updated last year
- A tool to detect infrastructure issues on cloud native AI systems☆52Updated 4 months ago
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆253Updated last week
- Awesome utilities for performance profiling☆199Updated 10 months ago
- A library to analyze PyTorch traces.☆464Updated last week
- Fast OS-level support for GPU checkpoint and restore☆270Updated 4 months ago