facebookincubator / dynologLinks
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆320Updated last week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below
Sorting:
- CUDA checkpoint and restore utility☆345Updated 5 months ago
- AI/GPU flame graph☆168Updated last month
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆113Updated 3 months ago
- Meta's fleetwide profiler framework☆315Updated 2 months ago
- DCPerf benchmark suite for hyperscale cloud applications☆191Updated this week
- NVIDIA Inference Xfer Library (NIXL)☆459Updated this week
- cricket is a virtualization solution for GPUs☆205Updated last month
- KV cache store for distributed LLM inference☆288Updated last month
- Ultra and Unified CCL☆390Updated this week
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆117Updated last year
- NVIDIA GPUDirect Storage Driver☆258Updated 2 months ago
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆385Updated this week
- NCCL Profiling Kit☆139Updated last year
- Unified Collective Communication Library☆259Updated last week
- NVIDIA NCCL Tests for Distributed Training☆97Updated 2 weeks ago
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆542Updated 2 months ago
- A validation and profiling tool for AI infrastructure☆320Updated last week
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆156Updated this week
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆158Updated last year
- ☆62Updated 5 months ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆385Updated this week
- A library to analyze PyTorch traces.☆391Updated this week
- Awesome utilities for performance profiling☆183Updated 4 months ago
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆147Updated last week
- RDMA and SHARP plugins for nccl library☆197Updated 3 weeks ago
- This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.☆177Updated this week
- Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions☆159Updated 6 years ago
- System performance analysis and characterization tool☆388Updated this week
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆82Updated last year
- A GPU-driven system framework for scalable AI applications☆116Updated 5 months ago