GPUprobe / gpuprobe-daemon
Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes
☆78Updated last month
Alternatives and similar repositories for gpuprobe-daemon:
Users that are interested in gpuprobe-daemon are comparing it to the libraries listed below
- NVIDIA Inference Xfer Library (NIXL)☆191Updated this week
- CUDA checkpoint and restore utility☆310Updated last month
- An I/O benchmark for deep Learning applications☆82Updated this week
- An efficient GPU resource sharing system with fine-grained control for Linux platforms.☆79Updated last year
- Fast OS-level support for GPU checkpoint and restore☆170Updated 3 weeks ago
- Hooked CUDA-related dynamic libraries by using automated code generation tools.☆150Updated last year
- A tool to detect infrastructure issues on cloud native AI systems☆28Updated this week
- NCCL Profiling Kit☆127Updated 8 months ago
- KV cache store for distributed LLM inference☆78Updated this week
- ☆43Updated 6 months ago
- DCPerf benchmark suite for hyperscale cloud applications☆160Updated this week
- GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)☆34Updated last year
- cricket is a virtualization solution for GPUs☆187Updated last month
- DeepSeek-V3/R1 inference performance simulator☆76Updated this week
- Artifacts for our NSDI'23 paper TGS☆76Updated 9 months ago
- Intelligent platform for AI workloads☆37Updated 2 years ago
- InfiniStore: an elastic serverless cloud storage system (VLDB'23)☆22Updated last year
- The criu-coordinator tool aims to enable checkpoint/restore support for distributed applications with CRIU.☆20Updated 2 weeks ago
- NVIDIA NCCL Tests for Distributed Training☆85Updated last week
- ☆44Updated 5 months ago
- Enabling Kubernetes to make pod placement decisions with platform intelligence.☆174Updated last month
- ☆30Updated 4 months ago
- Microsoft Collective Communication Library☆60Updated 4 months ago
- 🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.☆28Updated 3 months ago
- Resource Allocation for Dynamic Demands☆20Updated last year
- An interference-aware scheduler for fine-grained GPU sharing☆129Updated last month
- Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the…☆302Updated last week
- Cloud Native Benchmarking of Foundation Models☆24Updated 4 months ago
- [NSDI '24] DINT: Fast In-Kernel Distributed Transactions with eBPF☆43Updated 8 months ago
- Device-plugin for volcano vgpu which support hard resource isolation☆67Updated last week