vllm-project / vllm-openvinoLinks
☆23Updated 2 months ago
Alternatives and similar repositories for vllm-openvino
Users that are interested in vllm-openvino are comparing it to the libraries listed below
Sorting:
- Run Generative AI models with simple C++/Python API and using OpenVINO Runtime☆364Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆107Updated this week
- OpenVINO Tokenizers extension☆42Updated this week
- OpenAI Triton backend for Intel® GPUs☆215Updated this week
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆199Updated 2 weeks ago
- Fast and memory-efficient exact attention☆96Updated last week
- ☆430Updated last month
- OpenVINO Intel NPU Compiler☆73Updated last week
- 🤗 Optimum Intel: Accelerate inference with Intel optimization tools☆502Updated this week
- ☆139Updated 4 months ago
- The driver for LMCache core to run in vLLM☆55Updated 8 months ago
- ☆91Updated this week
- High-speed and easy-use LLM serving framework for local deployment☆130Updated 2 months ago
- Fast and memory-efficient exact attention☆194Updated last week
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.☆679Updated this week
- Large Language Model Text Generation Inference on Habana Gaudi☆34Updated 7 months ago
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆63Updated 4 months ago
- High performance Transformer implementation in C++.☆138Updated 9 months ago
- A GPU-driven system framework for scalable AI applications☆119Updated 8 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆83Updated this week
- LLM-Inference-Bench☆56Updated 3 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆180Updated last week
- AI Tensor Engine for ROCm☆292Updated this week
- Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond☆469Updated this week
- PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation☆30Updated 11 months ago
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆132Updated last month
- ☆78Updated 11 months ago
- DeepSeek-V3/R1 inference performance simulator☆170Updated 7 months ago
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆81Updated 4 months ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆63Updated last month