silogen / cluster-forgeLinks
Kubernetes operator which sets up all platform tools to have a cluster ready for applications to run.
☆14Updated this week
Alternatives and similar repositories for cluster-forge
Users that are interested in cluster-forge are comparing it to the libraries listed below
Sorting:
- AI Workload Orchestrator for Kubernetes☆12Updated this week
- TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and sup…☆373Updated last week
- ☆21Updated this week
- Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Serv…☆480Updated last month
- MLCube® is a project that reduces friction for machine learning by ensuring that models are easily portable and reproducible.☆157Updated 10 months ago
- Container plugin for Slurm Workload Manager☆356Updated last week
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆193Updated this week
- This repository contains the results and code for the MLPerf™ Training v3.1 benchmark.☆17Updated 6 months ago
- Provide Python access to the NVML library for GPU diagnostics☆242Updated 7 months ago
- Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.☆208Updated 3 months ago
- Unified specification for defining and executing ML workflows, making reproducibility, consistency, and governance easier across the ML p…☆93Updated last year
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆549Updated 2 months ago
- ☆52Updated 4 months ago
- Tools to deploy GPU clusters in the Cloud☆31Updated 2 years ago
- DGXC Benchmarking provides recipes in ready-to-use templates for evaluating performance of specific AI use cases across hardware and soft…☆18Updated 3 weeks ago
- NVIDIA NCCL Tests for Distributed Training☆97Updated this week
- ☆58Updated 10 months ago
- Examples for using ONNX Runtime for model training.☆338Updated 9 months ago
- ☆14Updated 2 months ago
- A library to analyze PyTorch traces.☆397Updated last week
- A tool for bandwidth measurements on NVIDIA GPUs.☆487Updated 3 months ago
- An open-source efficient deep learning framework/compiler, written in python.☆710Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆77Updated this week
- The Triton backend for the ONNX Runtime.☆156Updated this week
- GraphDef Editor: A port of the TensorFlow contrib.graph_editor package that operates over serialized graphs☆31Updated 2 years ago
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆307Updated last month
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆190Updated this week
- A validation and profiling tool for AI infrastructure☆323Updated last week
- ☆411Updated last year
- Common source, scripts and utilities for creating Triton backends.☆334Updated this week