project-codeflare / mlbatchLinks
Queuing and quota management for AI/ML batch jobs on Kubernetes
☆14Updated 6 months ago
Alternatives and similar repositories for mlbatch
Users that are interested in mlbatch are comparing it to the libraries listed below
Sorting:
- Run Slurm as a Kubernetes scheduler. A Slinky project.☆61Updated last week
- Simplifying the definition and execution, scaling and deployment of pipelines on the cloud.☆234Updated 2 years ago
- Estimate resources needed to train LLMs☆14Updated last month
- InstaSlice facilitates the use of Dynamic Resource Allocation (DRA) on Kubernetes clusters for GPU sharing☆30Updated last year
- Python library for Synthetic Data Generation☆52Updated last month
- A top-like tool for monitoring GPUs in a cluster☆84Updated last year
- Run Slurm on Kubernetes. A Slinky project.☆230Updated this week
- Kubernetes enhancements for Network Topology Aware Gang Scheduling & Autoscaling☆159Updated last week
- Helm charts for llm-d☆52Updated 6 months ago
- Repository for open inference protocol specification☆64Updated 8 months ago
- InstructLab Training Library - Efficient Fine-Tuning with Message-Format Data☆49Updated this week
- JobSet: a k8s native API for distributed ML training and HPC workloads☆304Updated this week
- An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.☆146Updated this week
- ☆44Updated this week
- IBM development fork of https://github.com/huggingface/text-generation-inference☆63Updated 4 months ago
- llm-d benchmark scripts and tooling☆44Updated this week
- 🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.☆56Updated this week
- Carbon Limiting Auto Tuning for Kubernetes☆37Updated last year
- InstaSlice Operator facilitates slicing of accelerators using stable APIs☆49Updated this week
- Model Registry provides a single pane of glass for ML model developers to index and manage models, versions, and ML artifacts metadata. I…☆161Updated last week
- Run Slurm in Kubernetes☆358Updated this week
- Synthetic Data Generation Toolkit for LLMs☆95Updated last week
- This repository contains the results and code for the MLPerf™ Training v4.0 benchmark.☆13Updated last year
- Taxonomy tree that will allow you to create models tuned with your data☆290Updated 4 months ago
- Ray-based Apache Beam runner☆42Updated 2 years ago
- ☆51Updated 6 months ago
- Kubernetes Operator, ansible playbooks, and production scripts for large-scale AIStore deployments on Kubernetes.☆124Updated this week
- A toolkit for discovering cluster network topology.☆96Updated this week
- ☆212Updated this week
- Python library for Evaluation☆16Updated this week