GoogleCloudPlatform / slurm-gcpLinks

☆51

Alternatives and similar repositories for slurm-gcp

Users that are interested in slurm-gcp are comparing it to the libraries listed below

Sorting:

GoogleCloudPlatform / cluster-toolkit
Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments…
☆277Updated this week
AI-Hypercomputer / xpk
xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerat…
☆136Updated this week
google / saxml
☆142Updated last week
pytorch / test-infra
This repository hosts code that supports the testing infrastructure for the PyTorch organization. For example, this repo hosts the logic …
☆96Updated this week
NVIDIA / ais-k8s
Kubernetes Operator, ansible playbooks, and production scripts for large-scale AIStore deployments on Kubernetes.
☆107Updated this week
coreweave / ml-containers
☆38Updated this week
AI-Hypercomputer / gpu-recipes
Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud.
☆78Updated this week
NVIDIA / pyxis
Container plugin for Slurm Workload Manager
☆369Updated this week
GoogleCloudPlatform / ml-testing-accelerators
Testing framework for Deep Learning models (Tensorflow and PyTorch) on Google Cloud hardware accelerators (TPU and GPU)
☆64Updated last month
SlinkyProject / slurm-bridge
Run Slurm as a Kubernetes scheduler. A Slinky project.
☆31Updated 2 weeks ago
AI-Hypercomputer / JetStream
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs wel…
☆365Updated last month
mila-iqia / milabench
Repository of machine learning benchmarks
☆39Updated this week
GoogleCloudPlatform / ml-auto-solutions
A simplified and automated orchestration workflow to perform ML end-to-end (E2E) model tests and benchmarking on Cloud VMs across differe…
☆50Updated this week
IBM / text-generation-inference
IBM development fork of https://github.com/huggingface/text-generation-inference
☆61Updated 3 months ago
Snowflake-Labs / vllm
☆15Updated 4 months ago
AI-Hypercomputer / tpu-recipes
☆39Updated 3 weeks ago
NVIDIA / mlperf-common
NVIDIA's launch, startup, and logging scripts used by our MLPerf Training and HPC submissions
☆29Updated 2 weeks ago
run-ai / rntop
A top-like tool for monitoring GPUs in a cluster
☆85Updated last year
jax-ml / ml_dtypes
A stand-alone implementation of several NumPy dtype extensions used in machine learning.
☆284Updated last week
foundation-model-stack / fms-acceleration
🚀 Collection of libraries used with fms-hf-tuning to accelerate fine-tuning and training of large models.
☆11Updated last month
mlcommons / mlcube
MLCube® is a project that reduces friction for machine learning by ensuring that models are easily portable and reproducible.
☆157Updated 10 months ago
google / drjax
☆12Updated last week
jax-ml / jax-tpu-embedding
☆19Updated this week
gpu-mode / discord-cluster-manager
Write a fast kernel and run it on Discord. See how you compare against the best!
☆48Updated this week
pytorch / torchx
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and sup…
☆381Updated this week
SchedMD / slurm-gcp
Slurm on Google Cloud Platform
☆189Updated 10 months ago
mlcommons / policies
General policies for MLPerf™ including submission rules, coding standards, etc.
☆30Updated last week
open-lm-engine / lm-engine
LM engine is a library for pretraining/finetuning LLMs
☆61Updated this week
pytorch / torchft
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
☆377Updated this week
google / jaxonnxruntime
A user-friendly tool chain that enables the seamless execution of ONNX models using JAX as the backend.
☆118Updated last week