facebookincubator / dynologLinks
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆358Updated this week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below
Sorting:
- CUDA checkpoint and restore utility☆400Updated 3 months ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆144Updated 9 months ago
- AI/GPU flame graph☆234Updated 2 months ago
- Meta's fleetwide profiler framework☆338Updated 3 months ago
- ☆72Updated 10 months ago
- cricket is a virtualization solution for GPUs☆228Updated 3 months ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆122Updated 2 years ago
- NVIDIA Inference Xfer Library (NIXL)☆788Updated this week
- DCPerf benchmark suite for hyperscale cloud applications☆225Updated this week
- Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, T…☆352Updated this week
- KV cache store for distributed LLM inference☆378Updated last month
- NVIDIA NCCL Tests for Distributed Training☆131Updated this week
- A validation and profiling tool for AI infrastructure☆352Updated this week
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆468Updated this week
- NVIDIA GPUDirect Storage Driver☆312Updated 2 weeks ago
- A tool to detect infrastructure issues on cloud native AI systems☆52Updated 3 months ago
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆638Updated last month
- Awesome utilities for performance profiling☆198Updated 10 months ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆446Updated last week
- NCCL Profiling Kit☆149Updated last year
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆154Updated 3 weeks ago
- RDMA and SHARP plugins for nccl library☆218Updated last month
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆172Updated 2 years ago
- A tool for bandwidth measurements on NVIDIA GPUs.☆598Updated 8 months ago
- A library to analyze PyTorch traces.☆453Updated 3 weeks ago
- Unified Collective Communication Library☆286Updated 2 weeks ago
- torchcomms: a modern PyTorch communications API☆315Updated this week
- Offline optimization of your disaggregated Dynamo graph☆136Updated last week
- Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions☆164Updated 6 years ago
- Perplexity open source garden for inference technology☆321Updated last week