GPUprobe / gpuprobe-daemonLinks
Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes
☆122Updated 4 months ago
Alternatives and similar repositories for gpuprobe-daemon
Users that are interested in gpuprobe-daemon are comparing it to the libraries listed below
Sorting:
- CUDA checkpoint and restore utility☆360Updated 6 months ago
- AI/GPU flame graph☆182Updated 3 weeks ago
- Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the…☆330Updated last week
- Fast OS-level support for GPU checkpoint and restore☆228Updated last week
- NVIDIA GPUDirect Storage Driver☆277Updated last week
- cricket is a virtualization solution for GPUs☆213Updated 2 months ago
- DCPerf benchmark suite for hyperscale cloud applications☆203Updated this week
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆165Updated last year
- Systematic and comprehensive benchmarks for LLM systems.☆26Updated 2 weeks ago
- A tool to detect infrastructure issues on cloud native AI systems☆45Updated 3 weeks ago
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆84Updated last year
- ☆33Updated last week
- ☆18Updated last month
- A tool for coordinated checkpoint/restore of distributed applications with CRIU☆27Updated this week
- KV cache store for distributed LLM inference☆311Updated 2 months ago
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆226Updated this week
- NVIDIA NCCL Tests for Distributed Training☆105Updated this week
- qCUDA: GPGPU Virtualization at a New API Remoting Method with Para-virtualization☆126Updated 3 years ago
- ☆66Updated 6 months ago
- An I/O benchmark for deep Learning applications☆90Updated 2 months ago
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆413Updated this week
- NVIDIA Inference Xfer Library (NIXL)☆536Updated this week
- A light weight vLLM simulator, for mocking out replicas.☆32Updated this week
- XRP: In-Kernel Storage Functions with eBPF☆232Updated 2 years ago
- An OS kernel module for fast **remote** fork using advanced datacenter networking (RDMA).☆64Updated 6 months ago
- GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)☆35Updated last year
- Practical GPU Sharing Without Memory Size Constraints☆280Updated 4 months ago
- ☆43Updated last year
- example code for using DC QP for providing RDMA READ and WRITE operations to remote GPU memory☆140Updated last year
- [NSDI '24] DINT: Fast In-Kernel Distributed Transactions with eBPF☆45Updated last year