NVIDIA / Bobber
Containerized testing of system components that impact AI workload performance
☆14Updated last year
Alternatives and similar repositories for Bobber:
Users that are interested in Bobber are comparing it to the libraries listed below
- Deep Learning Benchmarking Suite☆130Updated 2 years ago
- Imageinary is a reproducible mechanism which is used to generate large image datasets at various resolutions. The tool supports multiple …☆26Updated last year
- Python bindings for UCX☆123Updated this week
- RAPIDS GPU-BDB☆108Updated 10 months ago
- Enhanced networking support for TensorFlow. Maintained by SIG-networking.☆98Updated 3 years ago
- Scoreboard for ONNX Backend Compatibility☆27Updated this week
- Scheduling GPU cluster workloads with Slurm☆74Updated 6 years ago
- This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.☆160Updated this week
- Tools to deploy GPU clusters in the Cloud☆30Updated last year
- 3rd party dependencies for DALI project☆10Updated 2 weeks ago
- Kubernetes Operator, ansible playbooks, and production scripts for large-scale AIStore deployments on Kubernetes.☆88Updated last week
- oneAPI Collective Communications Library (oneCCL)☆218Updated last week
- NGC Container Replicator☆28Updated 2 years ago
- A top-like tool for monitoring GPUs in a cluster☆84Updated 11 months ago
- NVIDIA's launch, startup, and logging scripts used by our MLPerf Training and HPC submissions☆24Updated 3 weeks ago
- WIP. Veloce is a low-code Ray-based parallelization library that makes machine learning computation novel, efficient, and heterogeneous.☆18Updated 2 years ago
- MLPerf™ logging library☆32Updated 3 weeks ago
- Repository to go along with the paper "Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines"☆9Updated 2 years ago
- NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.☆114Updated last year
- ☆24Updated this week
- MLCube® is a project that reduces friction for machine learning by ensuring that models are easily portable and reproducible.☆155Updated 4 months ago
- A Ray-based data loader with per-epoch shuffling and configurable pipelining, for shuffling and loading training data for distributed tra…☆18Updated 2 years ago
- oneCCL Bindings for Pytorch*☆87Updated 3 weeks ago
- Singularity implementation of k8s operator for interacting with SLURM.☆117Updated 4 years ago
- The Triton backend for the PyTorch TorchScript models.☆141Updated last week
- AIBench, a tool for comparing and evaluating AI serving solutions. forked from [tsbs](https://github.com/timescale/tsbs) and adapted to A…☆20Updated 4 months ago
- Testing if I can implement slurm in an operator☆14Updated 2 months ago
- Reference implementations of MLPerf™ HPC training benchmarks☆45Updated 8 months ago
- Distributed preprocessing and data loading for language datasets☆39Updated 9 months ago
- Python bindings for NVTX☆66Updated last year