Trainy-ai / konduktor
cluster/scheduler health monitoring for GPU jobs on k8s
☆47Updated this week
Alternatives and similar repositories for konduktor:
Users that are interested in konduktor are comparing it to the libraries listed below
- WebAssembly dev environment for Envoy Proxy. Iterate on your HTTP/TCP middleware in seconds!☆54Updated last year
- Fine-tuning and serving LLMs on any cloud☆87Updated last year
- Profiling tools for distributed training☆38Updated last year
- Orchestrated process and container checkpointing☆73Updated this week
- visualize your gpu usage☆16Updated last year
- Cedana: Access and run on compute anywhere in the world, on any provider. Migrate seamlessly between providers, arbitraging price/perform…☆58Updated 9 months ago
- A simple DAG for executing LLM calls and using tools.☆39Updated last year
- CUDA checkpoint and restore utility☆274Updated this week
- Module, Model, and Tensor Serialization/Deserialization☆210Updated 2 months ago
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆110Updated 2 months ago
- Runner in charge of collecting metrics from LLM inference endpoints for the Unify Hub☆17Updated 11 months ago
- Kubernetes Operator, ansible playbooks, and production scripts for large-scale AIStore deployments on Kubernetes.☆88Updated last week
- ☆154Updated last week
- Cloud Native Benchmarking of Foundation Models☆21Updated 2 months ago
- This is a landscape of the infrastructure that powers the generative AI ecosystem☆135Updated 3 months ago
- Augment Swarm with durable execution to help you build reliable and scalable multi-agent systems.☆88Updated 2 months ago
- ⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.☆134Updated 7 months ago
- A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.☆60Updated 9 months ago
- A simple Pure Python/PyTorch performance daemon for training workloads☆15Updated last year
- ☆30Updated 2 years ago
- Pixeltable — AI Data infrastructure providing a declarative, incremental approach for multimodal workloads.☆144Updated this week
- Serverless LLM Serving for Everyone.☆407Updated this week
- JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs wel…☆268Updated this week
- ☆43Updated 7 months ago
- Making Long-Context LLM Inference 10x Faster and 10x Cheaper☆405Updated this week
- ☆294Updated 5 months ago
- NVIDIA NCCL Tests for Distributed Training☆78Updated this week
- Self-hardening firewall for large language models☆261Updated 11 months ago
- vscode extension to convert computationally intensive pytorch kernels to triton☆19Updated 3 months ago