☆286Feb 25, 2026Updated last week
Alternatives and similar repositories for runai-model-streamer
Users that are interested in runai-model-streamer are comparing it to the libraries listed below
Sorting:
- GPU environment and cluster management with LLM support☆658May 16, 2024Updated last year
- KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale☆1,160Updated this week
- Module, Model, and Tensor Serialization/Deserialization☆289Feb 6, 2026Updated last month
- Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, T…☆384Updated this week
- 💫 A lightweight p2p-based cache system for model distributions on Kubernetes. Reframing now to make it an unified cache system with POSI…☆26Dec 6, 2024Updated last year
- Gateway API Inference Extension☆597Updated this week
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆19Nov 18, 2024Updated last year
- LeaderWorkerSet: An API for deploying a group of pods as a unit of replication☆673Feb 26, 2026Updated last week
- Simplified Data Management and Sharing for Kubernetes☆17Updated this week
- The main purpose of runtime copilot is to assist with node runtime management tasks such as configuring registries, upgrading versions, i…☆12May 16, 2023Updated 2 years ago
- Model Express is a Rust-based component meant to be placed next to existing model inference systems to speed up their startup times and i…☆31Feb 27, 2026Updated last week
- [WIP] Better (FP8) attention for Hopper☆32Feb 24, 2025Updated last year
- OpenAI compatible API for open source LLMs☆16Oct 30, 2023Updated 2 years ago
- Fast and memory-efficient exact attention☆18Updated this week
- Container Object Storage Interface (COSI) provisioner responsible to interface with COSI drivers. NOTE: The content of this repo has bee…☆33Nov 26, 2024Updated last year
- dstack is an open-source control plane for running development, training, and inference jobs on GPUs—across hyperscalers, neoclouds, or o…☆2,055Updated this week
- Deploy ChatGLM on Modelz☆16Mar 20, 2023Updated 2 years ago
- A Datacenter Scale Distributed Inference Serving Framework☆6,154Feb 28, 2026Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆267Dec 4, 2025Updated 3 months ago
- A top-like tool for monitoring GPUs in a cluster☆84Feb 14, 2024Updated 2 years ago
- Tooling for exact and MinHash deduplication of large-scale text datasets☆72Feb 19, 2026Updated 2 weeks ago
- CUDA checkpoint and restore utility☆424Sep 15, 2025Updated 5 months ago
- Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This rep…☆60Oct 31, 2024Updated last year
- Kubernetes enhancements for Network Topology Aware Gang Scheduling & Autoscaling☆166Updated this week
- NVIDIA DRA Driver for GPUs☆579Updated this week
- ☆33Aug 9, 2024Updated last year
- A throughput-oriented high-performance serving framework for LLMs☆947Oct 29, 2025Updated 4 months ago
- IBM development fork of https://github.com/huggingface/text-generation-inference☆63Sep 18, 2025Updated 5 months ago
- WG Serving☆34Dec 15, 2025Updated 2 months ago
- vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization☆2,187Feb 27, 2026Updated last week
- Quantized Attention on GPU☆44Nov 22, 2024Updated last year
- OpenAI compatible API for TensorRT LLM triton backend☆220Aug 1, 2024Updated last year
- An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.☆150Updated this week
- AI Inference Operator for Kubernetes. The easiest way to serve ML models in production. Supports VLMs, LLMs, embeddings, and speech-to-te…☆1,158Feb 23, 2026Updated last week
- A workload for deploying LLM inference services on Kubernetes☆179Updated this week
- Triton kernels for Flux☆22Jul 7, 2025Updated 7 months ago
- Machine Learning Inference Graph Spec☆21Jul 27, 2019Updated 6 years ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆156Apr 7, 2025Updated 11 months ago
- Using short models to classify long texts☆21Mar 8, 2023Updated 2 years ago