silogen / cluster-forgeLinks
Kubernetes operator which sets up all platform tools to have a cluster ready for applications to run.
☆17Updated this week
Alternatives and similar repositories for cluster-forge
Users that are interested in cluster-forge are comparing it to the libraries listed below
Sorting:
- AI Workload Orchestrator for Kubernetes☆18Updated last week
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆205Updated last week
- MLCube® is a project that reduces friction for machine learning by ensuring that models are easily portable and reproducible.☆158Updated 2 months ago
- MAD (Model Automation and Dashboarding)☆31Updated this week
- oneCCL Bindings for Pytorch* (deprecated)☆104Updated last month
- Container plugin for Slurm Workload Manager☆412Updated last month
- The goal of the OSSCI Fleet is to provide a central mechanism to enable test automation, batch job scheduling, and developer access to a …☆13Updated 2 weeks ago
- Reference implementations of MLPerf® inference benchmarks☆1,525Updated this week
- ☆72Updated this week
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆262Updated this week
- TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and sup…☆412Updated this week
- Reference models for Intel(R) Gaudi(R) AI Accelerator☆170Updated last month
- Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.☆217Updated this week
- This repository contains the results and code for the MLPerf™ Training v3.1 benchmark.☆17Updated last year
- Module, Model, and Tensor Serialization/Deserialization☆286Updated 5 months ago
- ☆61Updated last year
- [DEPRECATED] Moved to ROCm/rocm-systems repo☆410Updated this week
- Provide Python access to the NVML library for GPU diagnostics☆258Updated 5 months ago
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆475Updated this week
- Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Serv…☆504Updated this week
- [DEPRECATED] Moved to ROCm/rocm-systems repo☆29Updated last week
- A tool to configure, launch and manage your machine learning experiments.☆216Updated this week
- ☆28Updated last week
- ☆280Updated this week
- A validation and profiling tool for AI infrastructure☆360Updated this week
- Issues related to MLPerf™ training policies, including rules and suggested changes☆95Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆85Updated this week
- Tensors and Dynamic neural networks in Python with strong GPU acceleration☆247Updated last week
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆64Updated 7 months ago
- ONNX Script enables developers to naturally author ONNX functions and models using a subset of Python.☆420Updated this week