GPUprobe / gpuprobe-daemonLinks
Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes
☆146Updated 10 months ago
Alternatives and similar repositories for gpuprobe-daemon
Users that are interested in gpuprobe-daemon are comparing it to the libraries listed below
Sorting:
- CUDA checkpoint and restore utility☆410Updated 4 months ago
- A tool to detect infrastructure issues on cloud native AI systems☆52Updated 4 months ago
- Systematic and comprehensive benchmarks for LLM systems.☆50Updated last week
- Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the…☆362Updated last week
- cricket is a virtualization solution for GPUs☆234Updated 4 months ago
- DCPerf benchmark suite for hyperscale cloud applications☆231Updated this week
- Fast OS-level support for GPU checkpoint and restore☆271Updated 4 months ago
- AI/GPU flame graph☆242Updated 4 months ago
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆172Updated 2 years ago
- A tool for coordinated checkpoint/restore of distributed applications with CRIU☆31Updated 5 months ago
- ☆38Updated 3 months ago
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆88Updated last year
- An I/O benchmark for deep Learning applications☆102Updated last month
- NVIDIA NCCL Tests for Distributed Training☆134Updated last week
- Offline optimization of your disaggregated Dynamo graph☆177Updated last week
- NVIDIA GPUDirect Storage Driver☆331Updated last month
- NCCL Profiling Kit☆150Updated last year
- ☆20Updated 6 months ago
- Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, T…☆365Updated this week
- A light weight vLLM simulator, for mocking out replicas.☆85Updated this week
- ☆235Updated last month
- KV cache store for distributed LLM inference☆390Updated 2 months ago
- GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)☆34Updated 2 years ago
- A toolkit for discovering cluster network topology.☆96Updated last week
- [NSDI '24] DINT: Fast In-Kernel Distributed Transactions with eBPF☆53Updated last year
- Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport☆73Updated 8 months ago
- DOCA Platform manages provisioning and service orchestration for Bluefield DPUs☆76Updated this week
- ☆71Updated 11 months ago
- qCUDA: GPGPU Virtualization at a New API Remoting Method with Para-virtualization☆133Updated 3 years ago
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆472Updated last week