run-ai / runai-model-streamerView external linksLinks
☆281Feb 4, 2026Updated last week
Alternatives and similar repositories for runai-model-streamer
Users that are interested in runai-model-streamer are comparing it to the libraries listed below
Sorting:
- High-performance safetensors model loader☆99Jan 13, 2026Updated last month
- KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale☆1,127Updated this week
- ☆215Updated this week
- Module, Model, and Tensor Serialization/Deserialization☆287Feb 6, 2026Updated last week
- Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, T…☆370Updated this week
- 💫 A lightweight p2p-based cache system for model distributions on Kubernetes. Reframing now to make it an unified cache system with POSI…☆25Dec 6, 2024Updated last year
- Model Express is a Rust-based component meant to be placed next to existing model inference systems to speed up their startup times and i…☆25Feb 6, 2026Updated last week
- LeaderWorkerSet: An API for deploying a group of pods as a unit of replication☆662Feb 2, 2026Updated last week
- (WIP) Parallel inference for black-forest-labs' FLUX model.☆18Nov 18, 2024Updated last year
- Simplified Data Management and Sharing for Kubernetes☆17Feb 6, 2026Updated last week
- The main purpose of runtime copilot is to assist with node runtime management tasks such as configuring registries, upgrading versions, i…☆12May 16, 2023Updated 2 years ago
- Custom Scheduler to deploy ML models to TRTIS for GPU Sharing☆11Apr 1, 2020Updated 5 years ago
- [WIP] Better (FP8) attention for Hopper☆32Feb 24, 2025Updated 11 months ago
- Fast and memory-efficient exact attention☆18Jan 23, 2026Updated 3 weeks ago
- OpenAI compatible API for open source LLMs☆16Oct 30, 2023Updated 2 years ago
- Container Object Storage Interface (COSI) provisioner responsible to interface with COSI drivers. NOTE: The content of this repo has bee…☆33Nov 26, 2024Updated last year
- Tooling for exact and MinHash deduplication of large-scale text datasets☆68Feb 4, 2026Updated last week
- Deploy ChatGLM on Modelz☆16Mar 20, 2023Updated 2 years ago
- A Datacenter Scale Distributed Inference Serving Framework☆6,052Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆267Dec 4, 2025Updated 2 months ago
- A top-like tool for monitoring GPUs in a cluster☆84Feb 14, 2024Updated 2 years ago
- Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This rep…☆60Oct 31, 2024Updated last year
- NVIDIA DRA Driver for GPUs☆557Feb 6, 2026Updated last week
- ☆33Aug 9, 2024Updated last year
- A throughput-oriented high-performance serving framework for LLMs☆945Oct 29, 2025Updated 3 months ago
- IBM development fork of https://github.com/huggingface/text-generation-inference☆63Sep 18, 2025Updated 4 months ago
- WG Serving☆34Dec 15, 2025Updated last month
- vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization☆2,156Feb 6, 2026Updated last week
- Quantized Attention on GPU☆44Nov 22, 2024Updated last year
- OpenAI compatible API for TensorRT LLM triton backend☆220Aug 1, 2024Updated last year
- An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.☆146Updated this week
- PyTorch implementation of models from the Zamba2 series.