facebookincubator / dynologView external linksLinks
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆362Updated this week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below
Sorting:
- A library to analyze PyTorch traces.☆464Feb 4, 2026Updated 2 weeks ago
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.☆922Updated this week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆156Updated this week
- NCCL Profiling Kit☆152Jul 1, 2024Updated last year
- Meta's fleetwide profiler framework☆342Sep 22, 2025Updated 4 months ago
- ☆13Feb 6, 2026Updated last week
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆475Updated this week
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆146Mar 29, 2025Updated 10 months ago
- Collection of scripts to build PyTorch and the domain libraries from source.☆13Feb 4, 2026Updated last week
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆476Feb 3, 2026Updated 2 weeks ago
- NVIDIA Inference Xfer Library (NIXL)☆885Updated this week
- Libraries, guides, blueprints, and sample code, to enable rapidly building 0-1 applications on iOS, Android and web.☆11May 12, 2023Updated 2 years ago
- CUDA checkpoint and restore utility☆415Sep 15, 2025Updated 5 months ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆209Sep 21, 2024Updated last year
- An LLM-based system that fully automates Chaos Engineering (ASE 2025, NIER track)☆23Jan 16, 2026Updated last month
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆664Dec 4, 2025Updated 2 months ago
- The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resou…☆510Feb 10, 2026Updated last week
- [DEPRECATED] Moved to ROCm/rocm-systems repo☆154Jan 21, 2026Updated 3 weeks ago
- A Datacenter Scale Distributed Inference Serving Framework☆6,095Updated this week
- CUDA Kernel Benchmarking Library☆813Updated this week
- C++ interfaces for RDMA access☆83Dec 22, 2025Updated last month
- A tool for bandwidth measurements on NVIDIA GPUs.☆623Apr 15, 2025Updated 10 months ago
- Microsoft Collective Communication Library☆382Sep 20, 2023Updated 2 years ago
- Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation☆80Oct 17, 2023Updated 2 years ago
- Collective communications library with various primitives for multi-machine training.☆1,399Updated this week
- Repository linking to the software artifacts used for the MigrOS ATC 2021 paper☆18May 31, 2021Updated 4 years ago
- A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology☆1,343Dec 17, 2025Updated 2 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆461May 30, 2025Updated 8 months ago
- A validation and profiling tool for AI infrastructure☆361Updated this week
- A low-latency & high-throughput serving engine for LLMs☆474Jan 8, 2026Updated last month
- A user level library for applications to transparently use Intel DSA.☆42Jan 23, 2026Updated 3 weeks ago
- KV cache store for distributed LLM inference☆392Nov 13, 2025Updated 3 months ago
- [IWQoS 2025] eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.☆21Aug 11, 2025Updated 6 months ago
- TransferBench is a utility capable of benchmarking simultaneous copies between user-specified devices (CPUs/GPUs)☆57Updated this week
- Lightning In-Memory Object Store☆47Jan 22, 2022Updated 4 years ago
- The repo for HotOS paper "FIFO can be Better than LRU: the Power of Lazy Promotion and Quick Demotion"☆35Jun 20, 2023Updated 2 years ago
- Fast OS-level support for GPU checkpoint and restore☆271Sep 28, 2025Updated 4 months ago
- ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale☆518Jan 3, 2026Updated last month
- LLTFI is a tool, which is an extension of LLFI, allowing users to run fault injection experiments on C/C++, TensorFlow and PyTorch applic…☆40Oct 4, 2024Updated last year