facebookincubator / dynologLinks
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆338Updated this week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below
Sorting:
- CUDA checkpoint and restore utility☆367Updated 7 months ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆128Updated 5 months ago
- Meta's fleetwide profiler framework☆317Updated 4 months ago
- NVIDIA Inference Xfer Library (NIXL)☆603Updated this week
- AI/GPU flame graph☆184Updated last month
- ☆69Updated 7 months ago
- DCPerf benchmark suite for hyperscale cloud applications☆203Updated this week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆572Updated 3 weeks ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆121Updated last year
- KV cache store for distributed LLM inference☆326Updated 3 months ago
- Awesome utilities for performance profiling☆188Updated 6 months ago
- NVIDIA GPUDirect Storage Driver☆281Updated last month
- cricket is a virtualization solution for GPUs☆215Updated this week
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆261Updated last week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆149Updated last week
- NVIDIA NCCL Tests for Distributed Training☆110Updated last week
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆425Updated this week
- ☆315Updated last year
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆414Updated this week
- Ultra and Unified CCL☆530Updated this week
- Unified Collective Communication Library☆275Updated this week
- A validation and profiling tool for AI infrastructure☆331Updated this week
- RDMA and SHARP plugins for nccl library☆203Updated this week
- A library to analyze PyTorch traces.☆406Updated 3 weeks ago
- NCCL Profiling Kit☆143Updated last year
- Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions☆160Updated 6 years ago
- A tool for bandwidth measurements on NVIDIA GPUs.☆527Updated 5 months ago
- A tool to detect infrastructure issues on cloud native AI systems☆47Updated last month
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆165Updated last year
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆216Updated this week