run-ai / genvLinks

GPU environment and cluster management with LLM support

☆642

Alternatives and similar repositories for genv

Users that are interested in genv are comparing it to the libraries listed below

Sorting:

nebuly-ai / nos
Module to Automatically maximize the utilization of GPU resources in a Kubernetes cluster through real-time dynamic partitioning and elas…
☆672Updated last year
coreweave / tensorizer
Module, Model, and Tensor Serialization/Deserialization
☆267Updated last month
run-ai / rntop
A top-like tool for monitoring GPUs in a cluster
☆85Updated last year
run-ai / runai-model-streamer
☆255Updated 2 weeks ago
meta-pytorch / torchx
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and sup…
☆394Updated last week
clearml / clearml-fractional-gpu
ClearML Fractional GPU - Run multiple containers on the same GPU with driver level memory limitation ✨ and compute time-slicing
☆80Updated last year
triton-inference-server / model_navigator
Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.
☆212Updated 5 months ago
triton-inference-server / pytriton
PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
☆823Updated 2 months ago
lambdal / lambda-stack-dockerfiles
☆278Updated 7 months ago
kserve / modelmesh
Distributed Model Serving Framework
☆177Updated 2 weeks ago
grgalex / nvshare
Practical GPU Sharing Without Memory Size Constraints
☆287Updated 6 months ago
NVIDIA / mig-parted
MIG Partition Editor for NVIDIA GPUs
☆217Updated this week
NVIDIA / DCGM
NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
☆598Updated last month
NVIDIA / KAI-Scheduler
KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale
☆849Updated last week
imbue-ai / cluster-health
☆315Updated last year
kserve / modelmesh-serving
Controller for ModelMesh
☆237Updated 4 months ago
triton-inference-server / model_analyzer
Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Serv…
☆494Updated this week
NVIDIA / cuda-checkpoint
CUDA checkpoint and restore utility
☆373Updated last month
run-ai / docs
markdown docs
☆94Updated this week
lambdal / deeplearning-benchmark
Benchmark Suite for Deep Learning
☆276Updated this week
wandb / server
W&B Server is the self hosted version of Weights & Biases
☆328Updated last week
leptonai / gpud
GPUd automates monitoring, diagnostics, and issue identification for GPUs
☆438Updated this week
ray-project / ray-llm
RayLLM - LLMs on Ray (Archived). Read README for more info.
☆1,263Updated 7 months ago
nebius / soperator
Run Slurm in Kubernetes
☆292Updated this week
NVIDIA / pyxis
Container plugin for Slurm Workload Manager
☆386Updated 2 weeks ago
meta-pytorch / torchft
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
☆415Updated last week
aimhubio / aimlflow
aim-mlflow integration
☆221Updated 2 years ago
huggingface / gpu-fryer
Where GPUs get cooked 👩‍🍳🔥
☆293Updated 3 weeks ago
clearml / clearml-serving
ClearML - Model-Serving Orchestration and Repository Solution
☆157Updated 2 weeks ago
clearml / clearml-agent
ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
☆278Updated 2 months ago