facebookincubator / dynologLinks
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆349Updated this week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below
Sorting:
- CUDA checkpoint and restore utility☆379Updated last month
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆134Updated 7 months ago
- AI/GPU flame graph☆189Updated 3 weeks ago
- Meta's fleetwide profiler framework☆327Updated last month
- NVIDIA Inference Xfer Library (NIXL)☆688Updated this week
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆121Updated last year
- cricket is a virtualization solution for GPUs☆217Updated last month
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆441Updated last week
- ☆72Updated 8 months ago
- KV cache store for distributed LLM inference☆346Updated last month
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆153Updated 2 weeks ago
- DCPerf benchmark suite for hyperscale cloud applications☆212Updated last week
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆298Updated last week
- Unified Collective Communication Library☆278Updated this week
- Awesome utilities for performance profiling☆196Updated 7 months ago
- A tool to detect infrastructure issues on cloud native AI systems☆49Updated last month
- NVIDIA GPUDirect Storage Driver☆294Updated 2 months ago
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆601Updated 2 weeks ago
- A validation and profiling tool for AI infrastructure☆342Updated last week
- A tool for bandwidth measurements on NVIDIA GPUs.☆553Updated 6 months ago
- RDMA and SHARP plugins for nccl library☆211Updated last week
- NVIDIA NCCL Tests for Distributed Training☆118Updated this week
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆427Updated this week
- NCCL Profiling Kit☆145Updated last year
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆228Updated last week
- A library to analyze PyTorch traces.☆419Updated 2 weeks ago
- Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions☆160Updated 6 years ago
- Offline optimization of your disaggregated Dynamo graph☆88Updated this week
- ☆316Updated last year
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆167Updated last year