Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆366Mar 12, 2026Updated last week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below
Sorting:
- A library to analyze PyTorch traces.☆474Updated this week
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.☆932Updated this week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆152Mar 9, 2026Updated last week
- Meta's fleetwide profiler framework☆345Sep 22, 2025Updated 5 months ago
- NCCL Profiling Kit☆152Jul 1, 2024Updated last year
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆479Updated this week
- CUPTI based GPU profiling library exposing usdt hooks☆26Mar 5, 2026Updated 2 weeks ago
- Collection of scripts to build PyTorch and the domain libraries from source.☆14Feb 4, 2026Updated last month
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆210Sep 21, 2024Updated last year
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆487Updated this week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆685Feb 17, 2026Updated last month
- NVIDIA Inference Xfer Library (NIXL)☆929Mar 13, 2026Updated last week
- An LLM-based system that fully automates Chaos Engineering (ASE 2025, NIER track)☆25Jan 16, 2026Updated 2 months ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆147Mar 29, 2025Updated 11 months ago
- CUDA checkpoint and restore utility☆429Sep 15, 2025Updated 6 months ago
- ☆18May 16, 2022Updated 3 years ago
- A Datacenter Scale Distributed Inference Serving Framework☆6,250Updated this week
- CUDA Kernel Benchmarking Library☆831Updated this week
- A tool for bandwidth measurements on NVIDIA GPUs.☆643Apr 15, 2025Updated 11 months ago
- ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale☆533Mar 12, 2026Updated last week
- Microsoft Collective Communication Library☆387Sep 20, 2023Updated 2 years ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆481Updated this week
- Collective communications library with various primitives for multi-machine training.☆1,405Mar 11, 2026Updated last week
- [DEPRECATED] Moved to ROCm/rocm-systems repo☆153Jan 21, 2026Updated last month
- The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resou…☆518Updated this week
- A low-latency & high-throughput serving engine for LLMs☆484Jan 8, 2026Updated 2 months ago
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆196Updated this week
- Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs☆1,000Mar 3, 2026Updated 2 weeks ago
- TransferBench is a utility capable of benchmarking simultaneous copies between user-specified devices (CPUs/GPUs)☆60Updated this week
- LLTFI is a tool, which is an extension of LLFI, allowing users to run fault injection experiments on C/C++, TensorFlow and PyTorch applic…☆41Updated this week
- Fast OS-level support for GPU checkpoint and restore☆277Sep 28, 2025Updated 5 months ago
- C# port of Google's SwissTable hash map☆17Sep 24, 2022Updated 3 years ago
- [IWQoS 2025] eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.☆21Aug 11, 2025Updated 7 months ago
- Optimized primitives for collective multi-GPU communication☆4,531Updated this week
- Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation☆80Oct 17, 2023Updated 2 years ago
- Scripts for managing a large H100 cluster and fixing hardware issues to ensure smooth model training.☆323Aug 20, 2024Updated last year
- A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology☆1,355Mar 12, 2026Updated last week
- Library containing safer alternatives/wrappers for insecure C APIs.☆24Apr 2, 2025Updated 11 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆466May 30, 2025Updated 9 months ago