facebookincubator / dynologLinks
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆326Updated this week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below
Sorting:
- CUDA checkpoint and restore utility☆353Updated 6 months ago
- Meta's fleetwide profiler framework☆316Updated 2 months ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆117Updated 4 months ago
- AI/GPU flame graph☆178Updated last week
- DCPerf benchmark suite for hyperscale cloud applications☆195Updated last week
- NVIDIA Inference Xfer Library (NIXL)☆491Updated this week
- ☆65Updated 5 months ago
- cricket is a virtualization solution for GPUs☆211Updated last month
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆192Updated last week
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆118Updated last year
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆401Updated this week
- KV cache store for distributed LLM inference☆303Updated last month
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆551Updated 2 months ago
- NVIDIA GPUDirect Storage Driver☆272Updated 3 months ago
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆160Updated last year
- NVIDIA NCCL Tests for Distributed Training☆100Updated last week
- Awesome utilities for performance profiling☆185Updated 4 months ago
- Ultra and Unified CCL☆440Updated this week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆147Updated last week
- A validation and profiling tool for AI infrastructure☆325Updated last week
- A tool to detect infrastructure issues on cloud native AI systems☆44Updated last week
- An I/O benchmark for deep Learning applications☆89Updated last month
- Unified Collective Communication Library☆262Updated this week
- NCCL Profiling Kit☆139Updated last year
- RDMA and SHARP plugins for nccl library☆199Updated last month
- A tool for bandwidth measurements on NVIDIA GPUs.☆492Updated 3 months ago
- System performance analysis and characterization tool☆390Updated this week
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆390Updated this week
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆84Updated last year
- Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions☆159Updated 6 years ago