Trainy-ai / konduktor

cluster/scheduler health monitoring for GPU jobs on k8s

☆47

Alternatives and similar repositories for konduktor:

Users that are interested in konduktor are comparing it to the libraries listed below

apoxy-dev / proximal
WebAssembly dev environment for Envoy Proxy. Iterate on your HTTP/TCP middleware in seconds!
☆54Updated last year
Trainy-ai / llm-atc
Fine-tuning and serving LLMs on any cloud
☆87Updated last year
Trainy-ai / nodify
Profiling tools for distributed training
☆38Updated last year
cedana / cedana
Orchestrated process and container checkpointing
☆73Updated this week
CambioML / gpuv
visualize your gpu usage
☆16Updated last year
cedana / cedana-cli
Cedana: Access and run on compute anywhere in the world, on any provider. Migrate seamlessly between providers, arbitraging price/perform…
☆58Updated 9 months ago
interlocklabs / trellis
A simple DAG for executing LLM calls and using tools.
☆39Updated last year
NVIDIA / cuda-checkpoint
CUDA checkpoint and restore utility
☆274Updated this week
coreweave / tensorizer
Module, Model, and Tensor Serialization/Deserialization
☆210Updated 2 months ago
intel / llm-on-ray
Pretrain, finetune and serve LLMs on Intel platforms with Ray
☆110Updated 2 months ago
unifyai / aibench-llm-endpoints
Runner in charge of collecting metrics from LLM inference endpoints for the Unify Hub
☆17Updated 11 months ago
NVIDIA / ais-k8s
Kubernetes Operator, ansible playbooks, and production scripts for large-scale AIStore deployments on Kubernetes.
☆88Updated last week
run-ai / runai-model-streamer
☆154Updated last week
fmperf-project / fmperf
Cloud Native Benchmarking of Foundation Models
☆21Updated 2 months ago
tensorchord / ai-infra-landscape
This is a landscape of the infrastructure that powers the generative AI ecosystem
☆135Updated 3 months ago
dbos-inc / durable-swarm
Augment Swarm with durable execution to help you build reliable and scalable multi-agent systems.
☆88Updated 2 months ago
autonomi-ai / nos
⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.
☆134Updated 7 months ago
asprenger / ray_vllm_inference
A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.
☆60Updated 9 months ago
Trainy-ai / trainy
A simple Pure Python/PyTorch performance daemon for training workloads
☆15Updated last year
feature-store / ralf
☆30Updated 2 years ago
pixeltable / pixeltable
Pixeltable — AI Data infrastructure providing a declarative, incremental approach for multimodal workloads.
☆144Updated this week
ServerlessLLM / ServerlessLLM
Serverless LLM Serving for Everyone.
☆407Updated this week
AI-Hypercomputer / JetStream
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs wel…
☆268Updated this week
tyler-griggs / melange-release
☆43Updated 7 months ago
LMCache / LMCache
Making Long-Context LLM Inference 10x Faster and 10x Cheaper
☆405Updated this week
imbue-ai / cluster-health
☆294Updated 5 months ago
coreweave / nccl-tests
NVIDIA NCCL Tests for Distributed Training
☆78Updated this week
automorphic-ai / aegis
Self-hardening firewall for large language models
☆261Updated 11 months ago
proxis-dev / vscode-triton
vscode extension to convert computationally intensive pytorch kernels to triton
☆19Updated 3 months ago