vllm-project / vllm-openvinoLinks
☆25Updated 3 months ago
Alternatives and similar repositories for vllm-openvino
Users that are interested in vllm-openvino are comparing it to the libraries listed below
Sorting:
- Run Generative AI models with simple C++/Python API and using OpenVINO Runtime☆371Updated this week
- ☆431Updated 2 months ago
- OpenAI Triton backend for Intel® GPUs☆218Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆108Updated this week
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆63Updated 4 months ago
- High-speed and easy-use LLM serving framework for local deployment☆132Updated 3 months ago
- Fast and memory-efficient exact attention☆97Updated last week
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆204Updated last month
- OpenVINO Tokenizers extension☆42Updated last week
- ☆126Updated this week
- OpenVINO Intel NPU Compiler☆73Updated last week
- 🤗 Optimum Intel: Accelerate inference with Intel optimization tools☆507Updated this week
- ☆168Updated this week
- AMD related optimizations for transformer models☆95Updated last month
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆63Updated 2 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆349Updated last year
- ☆33Updated 9 months ago
- FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.☆131Updated this week
- An experimental CPU backend for Triton☆160Updated last week
- Repo for SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting (ISCA25)☆67Updated 6 months ago
- LLM Inference via Triton (Flexible & Modular): Focused on Kernel Optimization using CUBIN binaries, Starting from gpt-oss Model☆56Updated last month
- Advanced quantization toolkit for LLMs. Native support for WOQ, MXFP4, NVFP4, GGUF, Adaptive Bits and seamless integration with Transform…☆712Updated this week
- The driver for LMCache core to run in vLLM☆56Updated 9 months ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆114Updated 6 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆119Updated last year
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆131Updated last month
- ☆48Updated last year
- This repository contains Dockerfiles, scripts, yaml files, Helm charts, etc. used to scale out AI containers with versions of TensorFlow …☆52Updated last week
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆268Updated 3 months ago
- KV cache store for distributed LLM inference☆361Updated this week