Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆373Jun 2, 2026Updated last week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A library to analyze PyTorch traces.☆524May 29, 2026Updated last week
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.☆958Updated this week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆155May 6, 2026Updated last month
- Meta's fleetwide profiler framework☆348Jun 2, 2026Updated last week
- NCCL Profiling Kit☆154Jul 1, 2024Updated last year
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆482Updated this week
- Collection of scripts to build PyTorch and the domain libraries from source.☆14Apr 1, 2026Updated 2 months ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆220Sep 21, 2024Updated last year
- CUPTI based GPU profiling library exposing usdt hooks☆32May 29, 2026Updated last week
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆506Updated this week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆737May 29, 2026Updated last week
- NVIDIA Inference Xfer Library (NIXL)☆1,072Updated this week
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆150Mar 29, 2025Updated last year
- CUDA checkpoint and restore utility☆459Sep 15, 2025Updated 8 months ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- ☆18May 16, 2022Updated 4 years ago
- CUDA Kernel Benchmarking Library☆870Updated this week
- A Datacenter Scale Distributed Inference Serving Framework☆7,200Updated this week
- An LLM-based system that fully automates Chaos Engineering (ASE 2025, NIER track)☆29Apr 6, 2026Updated 2 months ago
- A tool for bandwidth measurements on NVIDIA GPUs.☆709Apr 8, 2026Updated 2 months ago
- ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale☆602Apr 25, 2026Updated last month
- Microsoft Collective Communication Library☆390Sep 20, 2023Updated 2 years ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆530Updated this week
- Collective communications library with various primitives for multi-machine training.☆1,429Apr 21, 2026Updated last month
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- [DEPRECATED] Moved to ROCm/rocm-systems repo☆154May 28, 2026Updated last week
- The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resou…☆537May 28, 2026Updated last week
- Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs☆1,021Mar 3, 2026Updated 3 months ago
- A low-latency & high-throughput serving engine for LLMs☆506Jan 8, 2026Updated 5 months ago
- ☆13Feb 6, 2026Updated 4 months ago
- TransferBench is a utility capable of benchmarking simultaneous copies between user-specified devices (CPUs/GPUs)☆69Jun 2, 2026Updated last week
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆210Updated this week
- LLTFI is a tool, which is an extension of LLFI, allowing users to run fault injection experiments on C/C++, TensorFlow and PyTorch applic…☆44Updated this week
- C# port of Google's SwissTable hash map☆17Sep 24, 2022Updated 3 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- [IWQoS 2025] eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.☆23Aug 11, 2025Updated 9 months ago
- Fast OS-level support for GPU checkpoint and restore☆283Sep 28, 2025Updated 8 months ago
- Optimized primitives for collective multi-GPU communication☆4,785Updated this week
- Scripts for managing a large H100 cluster and fixing hardware issues to ensure smooth model training.☆326Aug 20, 2024Updated last year
- Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation☆82Oct 17, 2023Updated 2 years ago
- Library containing safer alternatives/wrappers for insecure C APIs.☆24Apr 28, 2026Updated last month
- A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology☆1,380Mar 12, 2026Updated 2 months ago