Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆370Apr 24, 2026Updated this week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A library to analyze PyTorch traces.☆510Apr 22, 2026Updated last week
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.☆948Updated this week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆154Apr 22, 2026Updated last week
- Meta's fleetwide profiler framework☆346Apr 6, 2026Updated 3 weeks ago
- NCCL Profiling Kit☆153Jul 1, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆479Updated this week
- CUPTI based GPU profiling library exposing usdt hooks☆29Apr 15, 2026Updated 2 weeks ago
- Collection of scripts to build PyTorch and the domain libraries from source.☆14Apr 1, 2026Updated 3 weeks ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆214Sep 21, 2024Updated last year
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆499Apr 3, 2026Updated 3 weeks ago
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆715Apr 21, 2026Updated last week
- NVIDIA Inference Xfer Library (NIXL)☆1,003Updated this week
- An LLM-based system that fully automates Chaos Engineering (ASE 2025, NIER track)☆26Apr 6, 2026Updated 3 weeks ago
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆148Mar 29, 2025Updated last year
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- CUDA checkpoint and restore utility☆443Sep 15, 2025Updated 7 months ago
- ☆18May 16, 2022Updated 3 years ago
- CUDA Kernel Benchmarking Library☆858Apr 22, 2026Updated last week
- A Datacenter Scale Distributed Inference Serving Framework☆6,634Updated this week
- A tool for bandwidth measurements on NVIDIA GPUs.☆681Apr 8, 2026Updated 3 weeks ago
- ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale☆569Updated this week
- Microsoft Collective Communication Library☆389Sep 20, 2023Updated 2 years ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆505Updated this week
- Collective communications library with various primitives for multi-machine training.☆1,419Apr 21, 2026Updated last week
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- [DEPRECATED] Moved to ROCm/rocm-systems repo☆153Apr 14, 2026Updated 2 weeks ago
- The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resou…☆529Updated this week
- Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs☆1,009Mar 3, 2026Updated last month
- A low-latency & high-throughput serving engine for LLMs☆496Jan 8, 2026Updated 3 months ago
- ☆13Feb 6, 2026Updated 2 months ago
- TransferBench is a utility capable of benchmarking simultaneous copies between user-specified devices (CPUs/GPUs)☆64Updated this week
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆203Updated this week
- LLTFI is a tool, which is an extension of LLFI, allowing users to run fault injection experiments on C/C++, TensorFlow and PyTorch applic…☆40Updated this week
- [IWQoS 2025] eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.☆22Aug 11, 2025Updated 8 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Fast OS-level support for GPU checkpoint and restore☆281Sep 28, 2025Updated 7 months ago
- Optimized primitives for collective multi-GPU communication☆4,656Updated this week
- Scripts for managing a large H100 cluster and fixing hardware issues to ensure smooth model training.☆324Aug 20, 2024Updated last year
- Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation☆81Oct 17, 2023Updated 2 years ago
- A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology☆1,370Mar 12, 2026Updated last month
- Library containing safer alternatives/wrappers for insecure C APIs.☆24Apr 6, 2026Updated 3 weeks ago
- FlashInfer: Kernel Library for LLM Serving☆5,498Updated this week