Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the linux kernel, CPU, disks, Intel PT, GPUs etc. Dynolog also integrates with pytorch and can trigger traces for distributed training applications.
☆375Jun 24, 2026Updated this week
Alternatives and similar repositories for dynolog
Users that are interested in dynolog are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A library to analyze PyTorch traces.☆531May 29, 2026Updated last month
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.☆967Updated this week
- PArametrized Recommendation and Ai Model benchmark is a repository for development of numerous uBenchmarks as well as end to end nets for…☆155Jun 9, 2026Updated 2 weeks ago
- Meta's fleetwide profiler framework☆349Jun 12, 2026Updated 2 weeks ago
- NCCL Profiling Kit☆155Jul 1, 2024Updated last year
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆484Updated this week
- Collection of scripts to build PyTorch and the domain libraries from source.☆14Jun 9, 2026Updated 2 weeks ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆222Sep 21, 2024Updated last year
- CUPTI based GPU profiling library exposing usdt hooks☆33Updated this week
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆514Jun 9, 2026Updated 2 weeks ago
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆746Jun 11, 2026Updated 2 weeks ago
- NVIDIA Inference Xfer Library (NIXL)☆1,106Updated this week
- Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes☆151Mar 29, 2025Updated last year
- CUDA checkpoint and restore utility☆467Sep 15, 2025Updated 9 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- ☆18May 16, 2022Updated 4 years ago
- CUDA Kernel Benchmarking Library☆878Jun 22, 2026Updated last week
- A Datacenter Scale Distributed Inference Serving Framework☆7,352Updated this week
- An LLM-based system that fully automates Chaos Engineering (ASE 2025, NIER track)☆29Apr 6, 2026Updated 2 months ago
- A tool for bandwidth measurements on NVIDIA GPUs.☆720Apr 8, 2026Updated 2 months ago
- Microsoft Collective Communication Library☆391Sep 20, 2023Updated 2 years ago
- Collective communications library with various primitives for multi-machine training.☆1,437Jun 17, 2026Updated last week
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆533Updated this week
- [DEPRECATED] Moved to ROCm/rocm-systems repo☆154May 28, 2026Updated last month
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- The NVIDIA® Tools Extension SDK (NVTX) is a C-based Application Programming Interface (API) for annotating events, code ranges, and resou…☆540Jun 19, 2026Updated last week
- Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs☆1,027Mar 3, 2026Updated 3 months ago
- A low-latency & high-throughput serving engine for LLMs☆507Jan 8, 2026Updated 5 months ago
- ☆13Feb 6, 2026Updated 4 months ago
- TransferBench is a utility capable of benchmarking simultaneous copies between user-specified devices (CPUs/GPUs)☆71Jun 22, 2026Updated last week
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆212Jun 10, 2026Updated 2 weeks ago
- LLTFI is a tool, which is an extension of LLFI, allowing users to run fault injection experiments on C/C++, TensorFlow and PyTorch applic…☆44Jun 18, 2026Updated last week
- [IWQoS 2025] eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.☆23Aug 11, 2025Updated 10 months ago
- Fast OS-level support for GPU checkpoint and restore☆285Sep 28, 2025Updated 9 months ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Optimized primitives for collective multi-GPU communication☆4,832Updated this week
- Scripts for managing a large H100 cluster and fixing hardware issues to ensure smooth model training.☆325Aug 20, 2024Updated last year
- Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation☆82Oct 17, 2023Updated 2 years ago
- Library containing safer alternatives/wrappers for insecure C APIs.☆24Apr 28, 2026Updated 2 months ago
- A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology☆1,392Jun 15, 2026Updated 2 weeks ago
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.☆5,672Updated this week
- FlashInfer: Kernel Library for LLM Serving☆5,867Updated this week