vllm-project / vllm-openvinoLinks
☆25Updated last month
Alternatives and similar repositories for vllm-openvino
Users that are interested in vllm-openvino are comparing it to the libraries listed below
Sorting:
- Run Generative AI models with simple C++/Python API and using OpenVINO Runtime☆342Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆102Updated this week
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆194Updated last week
- OpenAI Triton backend for Intel® GPUs☆208Updated this week
- OpenVINO Tokenizers extension☆40Updated last week
- This repository contains Dockerfiles, scripts, yaml files, Helm charts, etc. used to scale out AI containers with versions of TensorFlow …☆52Updated this week
- ☆428Updated last week
- Fast and memory-efficient exact attention☆94Updated last week
- ☆166Updated this week
- ☆45Updated this week
- AMD related optimizations for transformer models☆88Updated last month
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆62Updated 2 months ago
- OpenVINO Intel NPU Compiler☆67Updated last week
- 🤗 Optimum Intel: Accelerate inference with Intel optimization tools☆494Updated last week
- ☆56Updated this week
- ☆78Updated 10 months ago
- Fast and memory-efficient exact attention☆189Updated this week
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆71Updated 3 months ago
- AI Tensor Engine for ROCm☆279Updated this week
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆58Updated last week
- The driver for LMCache core to run in vLLM☆51Updated 7 months ago
- High performance Transformer implementation in C++.☆134Updated 8 months ago
- High-speed and easy-use LLM serving framework for local deployment☆119Updated last month
- ☆88Updated this week
- An experimental CPU backend for Triton☆153Updated 3 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated last year
- This repository is a read-only mirror of https://gitlab.arm.com/kleidi/kleidiai☆81Updated last week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆83Updated this week
- FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.☆86Updated this week
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.☆638Updated this week