sgl-project / omeLinks
Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, TensorRT-LLM, and Triton
☆356Updated this week
Alternatives and similar repositories for ome
Users that are interested in ome are comparing it to the libraries listed below
Sorting:
- Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond☆753Updated last week
- A workload for deploying LLM inference services on Kubernetes☆156Updated last week
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆251Updated this week
- Offline optimization of your disaggregated Dynamo graph☆151Updated this week
- GPUd automates monitoring, diagnostics, and issue identification for GPUs☆469Updated this week
- A light weight vLLM simulator, for mocking out replicas.☆83Updated this week
- NVIDIA Inference Xfer Library (NIXL)☆820Updated this week
- NVIDIA NCCL Tests for Distributed Training☆133Updated last week
- Efficient and easy multi-instance LLM serving☆521Updated 4 months ago
- KV cache store for distributed LLM inference☆385Updated 2 months ago
- The driver for LMCache core to run in vLLM☆58Updated 11 months ago
- Distributed KV cache scheduling & offloading libraries☆98Updated this week
- ☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!☆286Updated last week
- AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solu…☆90Updated this week
- Cloud Native Benchmarking of Foundation Models☆44Updated 5 months ago
- Kubernetes enhancements for Network Topology Aware Gang Scheduling & Autoscaling☆143Updated this week
- A high-performance and light-weight router for vLLM large scale deployment☆82Updated 3 weeks ago
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆375Updated this week
- LeaderWorkerSet: An API for deploying a group of pods as a unit of replication☆652Updated this week
- Inference scheduler for llm-d☆120Updated this week
- CUDA checkpoint and restore utility☆403Updated 4 months ago
- Kubernetes-native AI serving platform for scalable model serving.☆168Updated this week
- ☆274Updated last week
- A toolkit for discovering cluster network topology.☆90Updated this week
- Materials for learning SGLang☆717Updated 2 weeks ago
- Perplexity GPU Kernels☆553Updated 2 months ago
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆131Updated 3 months ago
- GLake: optimizing GPU memory management and IO transmission.☆496Updated 9 months ago
- torchcomms: a modern PyTorch communications API☆320Updated last week
- Gateway API Inference Extension☆563Updated last week