GPUprobe / gpuprobe-daemonLinks
Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes
☆143Updated 8 months ago
Alternatives and similar repositories for gpuprobe-daemon
Users that are interested in gpuprobe-daemon are comparing it to the libraries listed below
Sorting:
- CUDA checkpoint and restore utility☆397Updated 3 months ago
- cricket is a virtualization solution for GPUs☆227Updated 3 months ago
- Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the…☆356Updated last week
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆172Updated 2 years ago
- Fast OS-level support for GPU checkpoint and restore☆261Updated 2 months ago
- A tool to detect infrastructure issues on cloud native AI systems☆52Updated 3 months ago
- A tool for coordinated checkpoint/restore of distributed applications with CRIU☆30Updated 3 months ago
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆87Updated last year
- Systematic and comprehensive benchmarks for LLM systems.☆44Updated 3 weeks ago
- AI/GPU flame graph☆232Updated 2 months ago
- ☆21Updated 5 months ago
- NVIDIA NCCL Tests for Distributed Training☆129Updated this week
- Offline optimization of your disaggregated Dynamo graph☆128Updated this week
- DCPerf benchmark suite for hyperscale cloud applications☆223Updated last week
- ☆214Updated 4 months ago
- NVIDIA GPUDirect Storage Driver☆310Updated this week
- qCUDA: GPGPU Virtualization at a New API Remoting Method with Para-virtualization☆131Updated 3 years ago
- OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)☆341Updated this week
- This repository is an archive. Refer to https://github.com/gvirtus/GVirtuS☆44Updated 3 years ago
- ☆37Updated 2 months ago
- NCCL Profiling Kit☆149Updated last year
- Artifacts for our NSDI'23 paper TGS☆93Updated last year
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆464Updated last week
- GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)☆34Updated 2 years ago
- An I/O benchmark for deep Learning applications☆95Updated last week
- A light weight vLLM simulator, for mocking out replicas.☆65Updated this week
- A workload for deploying LLM inference services on Kubernetes☆140Updated this week
- An OS kernel module for fast **remote** fork using advanced datacenter networking (RDMA).☆69Updated 10 months ago
- example code for using DC QP for providing RDMA READ and WRITE operations to remote GPU memory☆150Updated last year
- KV cache store for distributed LLM inference☆376Updated last month