vllm-project / vllm-openvinoLinks
☆27Updated last week
Alternatives and similar repositories for vllm-openvino
Users that are interested in vllm-openvino are comparing it to the libraries listed below
Sorting:
- Run Generative AI models with simple C++/Python API and using OpenVINO Runtime☆388Updated this week
- ☆431Updated 2 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆208Updated 2 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆111Updated this week
- Fast and memory-efficient exact attention☆104Updated this week
- OpenAI Triton backend for Intel® GPUs☆222Updated this week
- ☆134Updated last week
- High performance Transformer implementation in C++.☆143Updated 10 months ago
- OpenVINO Intel NPU Compiler☆74Updated last week
- Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond☆701Updated last week
- Fast OS-level support for GPU checkpoint and restore☆260Updated 2 months ago
- High-speed and easy-use LLM serving framework for local deployment☆137Updated 4 months ago
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆130Updated 2 months ago
- LLM-Inference-Bench☆56Updated 4 months ago
- 🤗 Optimum Intel: Accelerate inference with Intel optimization tools☆515Updated last week
- A low-latency & high-throughput serving engine for LLMs☆454Updated last month
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆64Updated 2 months ago
- An experimental CPU backend for Triton☆165Updated last month
- ☆52Updated this week
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆94Updated 6 months ago
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆345Updated this week
- DeepSeek-V3/R1 inference performance simulator☆169Updated 8 months ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆118Updated 6 months ago
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆263Updated last month
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆63Updated 5 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆446Updated 6 months ago
- Development repository for the Triton language and compiler☆137Updated this week
- Efficient and easy multi-instance LLM serving☆515Updated 3 months ago
- KV cache store for distributed LLM inference☆371Updated 3 weeks ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆246Updated last year