facebookincubator / dynologLinks
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆330Updated last week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below
Sorting:
- CUDA checkpoint and restore utility☆360Updated 6 months ago
- Meta's fleetwide profiler framework☆317Updated 3 months ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆124Updated 4 months ago
- AI/GPU flame graph☆182Updated 3 weeks ago
- NVIDIA Inference Xfer Library (NIXL)☆557Updated this week
- DCPerf benchmark suite for hyperscale cloud applications☆203Updated this week
- ☆66Updated 6 months ago
- KV cache store for distributed LLM inference☆311Updated 2 months ago
- cricket is a virtualization solution for GPUs☆213Updated 2 months ago
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆226Updated this week
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆120Updated last year
- Awesome utilities for performance profiling☆186Updated 5 months ago
- Ultra and Unified CCL☆497Updated this week
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆413Updated this week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆561Updated this week
- NVIDIA NCCL Tests for Distributed Training☆105Updated this week
- Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions☆159Updated 6 years ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆399Updated this week
- A validation and profiling tool for AI infrastructure☆328Updated this week
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆165Updated last year
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆149Updated this week
- A tool for bandwidth measurements on NVIDIA GPUs.☆515Updated 4 months ago
- Unified Collective Communication Library☆263Updated last week
- NVIDIA GPUDirect Storage Driver☆277Updated last week
- A tool to detect infrastructure issues on cloud native AI systems☆45Updated 3 weeks ago
- A library to analyze PyTorch traces.☆404Updated last week
- ☆314Updated last year
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆84Updated last year
- oneAPI Collective Communications Library (oneCCL)☆241Updated 2 weeks ago
- An I/O benchmark for deep Learning applications☆90Updated 2 months ago