NVIDIA / groveLinks

Kubernetes enhancements for Network Topology Aware Gang Scheduling & Autoscaling

☆67

Alternatives and similar repositories for grove

Users that are interested in grove are comparing it to the libraries listed below

Sorting:

NVIDIA / topograph
A toolkit for discovering cluster network topology.
☆72Updated this week
llm-d / llm-d-inference-scheduler
Inference scheduler for llm-d
☆99Updated this week
NVIDIA / KAI-Scheduler
KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale
☆857Updated this week
kubernetes-sigs / gateway-api-inference-extension
Gateway API Inference Extension
☆495Updated this week
sgl-project / ome
OME is a Kubernetes operator for enterprise-grade management and serving of Large Language Models (LLMs)
☆292Updated this week
NVIDIA / k8s-nim-operator
An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.
☆131Updated this week
NVIDIA / k8s-dra-driver-gpu
NVIDIA DRA Driver for GPUs
☆456Updated last week
kubernetes-sigs / lws
LeaderWorkerSet: An API for deploying a group of pods as a unit of replication
☆597Updated this week
kubernetes-sigs / inference-perf
GenAI inference performance benchmarking tool
☆105Updated this week
llm-d / llm-d-inference-sim
A light weight vLLM simulator, for mocking out replicas.
☆52Updated 2 weeks ago
run-ai / fake-gpu-operator
☆151Updated last week
InftyAI / llmaz
☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!
☆260Updated this week
kubernetes-sigs / jobset
JobSet: a k8s native API for distributed ML training and HPC workloads
☆266Updated this week
NVIDIA / knavigator
knavigator is a development, testing, and optimization toolkit for AI/ML scheduling systems at scale on Kubernetes.
☆70Updated 3 months ago
NVIDIA / mig-parted
MIG Partition Editor for NVIDIA GPUs
☆217Updated last week
llm-d / llm-d-model-service
Simplified model deployment on llm-d
☆27Updated 3 months ago
llm-d / llm-d-kv-cache-manager
Distributed KV cache coordinator
☆78Updated last week
NVIDIA / cuda-checkpoint
CUDA checkpoint and restore utility
☆376Updated last month
IBM / autopilot
A tool to detect infrastructure issues on cloud native AI systems
☆48Updated last month
ai-dynamo / nixl
NVIDIA Inference Xfer Library (NIXL)
☆663Updated this week
kubernetes-sigs / wg-serving
WG Serving
☆30Updated this week
project-codeflare / multi-cluster-app-dispatcher
Holistic job manager on Kubernetes
☆116Updated last year
coreweave / nccl-tests
NVIDIA NCCL Tests for Distributed Training
☆114Updated this week
leptonai / gpud
GPUd automates monitoring, diagnostics, and issue identification for GPUs
☆438Updated last week
llm-d / llm-d-benchmark
llm-d benchmark scripts and tooling
☆30Updated this week
Mellanox / network-operator
NVIDIA Network Operator
☆284Updated this week
kubernetes-sigs / dra-example-driver
Example DRA driver that developers can fork and modify to get them started writing their own.
☆94Updated last month
llm-d / llm-d
Achieve state of the art inference performance with modern accelerators on Kubernetes
☆1,876Updated this week
llm-d / llm-d-deployer
Helm charts for llm-d
☆50Updated 2 months ago
sgl-project / rbg
A workload for deploying LLM inference services on Kubernetes
☆77Updated this week