Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆369Apr 4, 2026Updated this week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A library to analyze PyTorch traces.☆495Apr 1, 2026Updated last week
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.☆942Updated this week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆153Apr 1, 2026Updated last week
- Meta's fleetwide profiler framework☆346Updated this week
- NCCL Profiling Kit☆152Jul 1, 2024Updated last year
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆479Updated this week
- CUPTI based GPU profiling library exposing usdt hooks☆28Apr 2, 2026Updated last week
- Collection of scripts to build PyTorch and the domain libraries from source.☆14Apr 1, 2026Updated last week
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆213Sep 21, 2024Updated last year
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆491Updated this week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆701Mar 30, 2026Updated last week
- NVIDIA Inference Xfer Library (NIXL)☆963Apr 2, 2026Updated last week
- An LLM-based system that fully automates Chaos Engineering (ASE 2025, NIER track)☆26Apr 1, 2026Updated last week
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆148Mar 29, 2025Updated last year
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- CUDA checkpoint and restore utility☆434Sep 15, 2025Updated 6 months ago
- ☆18May 16, 2022Updated 3 years ago
- A Datacenter Scale Distributed Inference Serving Framework☆6,470Apr 3, 2026Updated last week
- CUDA Kernel Benchmarking Library☆847Updated this week
- A tool for bandwidth measurements on NVIDIA GPUs.☆655Apr 15, 2025Updated 11 months ago
- ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale☆553Mar 25, 2026Updated 2 weeks ago
- Microsoft Collective Communication Library☆389Sep 20, 2023Updated 2 years ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆496Apr 2, 2026Updated last week
- Collective communications library with various primitives for multi-machine training.☆1,407Mar 20, 2026Updated 3 weeks ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- [DEPRECATED] Moved to ROCm/rocm-systems repo☆153Mar 31, 2026Updated last week
- The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resou…☆523Updated this week
- Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs☆1,004Mar 3, 2026Updated last month
- A low-latency & high-throughput serving engine for LLMs☆490Jan 8, 2026Updated 3 months ago
- ☆13Feb 6, 2026Updated 2 months ago
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆198Updated this week
- TransferBench is a utility capable of benchmarking simultaneous copies between user-specified devices (CPUs/GPUs)☆63Updated this week
- LLTFI is a tool, which is an extension of LLFI, allowing users to run fault injection experiments on C/C++, TensorFlow and PyTorch applic…☆40Mar 19, 2026Updated 3 weeks ago
- Fast OS-level support for GPU checkpoint and restore☆280Sep 28, 2025Updated 6 months ago
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- [IWQoS 2025] eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.☆22Aug 11, 2025Updated 7 months ago
- Optimized primitives for collective multi-GPU communication☆4,600Updated this week
- Scripts for managing a large H100 cluster and fixing hardware issues to ensure smooth model training.☆323Aug 20, 2024Updated last year
- Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation☆81Oct 17, 2023Updated 2 years ago
- A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology☆1,360Mar 12, 2026Updated 3 weeks ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆470May 30, 2025Updated 10 months ago
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year