vllm-project / vllm-openvinoLinks

☆25

Alternatives and similar repositories for vllm-openvino

Users that are interested in vllm-openvino are comparing it to the libraries listed below

Sorting:

openvinotoolkit / openvino.genai
Run Generative AI models with simple C++/Python API and using OpenVINO Runtime
☆371Updated this week
ROCm / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆108Updated this week
intel / xFasterTransformer
☆431Updated 2 months ago
intel / intel-xpu-backend-for-triton
OpenAI Triton backend for Intel® GPUs
☆219Updated this week
OpenBMB / CPM.cu
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…
☆204Updated last month
openvinotoolkit / openvino_tokenizers
OpenVINO Tokenizers extension
☆42Updated this week
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆97Updated 2 weeks ago
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆89Updated 5 months ago
intel / intel-extension-for-deepspeed
Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…
☆63Updated 4 months ago
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆160Updated last week
huggingface / optimum-amd
AMD related optimizations for transformer models
☆95Updated last month
huggingface / optimum-intel
🤗 Optimum Intel: Accelerate inference with Intel optimization tools
☆507Updated this week
flagos-ai / FlagCX
☆126Updated this week
openvinotoolkit / npu_compiler
OpenVINO Intel NPU Compiler
☆73Updated last week
powerserve-project / PowerServe
High-speed and easy-use LLM serving framework for local deployment
☆134Updated 3 months ago
mlc-ai / relax
☆169Updated last week
ROCm / flash-attention
Fast and memory-efficient exact attention
☆200Updated last month
flagos-ai / flagtree
FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.
☆131Updated last week
ROCm / TransformerEngine
☆51Updated this week
intel / neural-speed
An innovative library for efficient LLM inference via low-bit quantization
☆349Updated last year
zartbot / shallowsim
DeepSeek-V3/R1 inference performance simulator
☆168Updated 7 months ago
argonne-lcf / LLM-Inference-Bench
LLM-Inference-Bench
☆57Updated 4 months ago
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆204Updated last week
HabanaAI / vllm-fork
A high-throughput and memory-efficient inference and serving engine for LLMs
☆85Updated last week
intel / llm-on-ray
Pretrain, finetune and serve LLMs on Intel platforms with Ray
☆131Updated last month
bytedance / ByteMLPerf
AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…
☆268Updated 3 months ago
AutonomicPerfectionist / PipeInfer
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
☆30Updated last year
InternLM / turbomind
☆97Updated 7 months ago
intel / torch-xpu-ops
☆61Updated this week
intel / auto-round
Advanced quantization toolkit for LLMs. Native support for WOQ, MXFP4, NVFP4, GGUF, Adaptive Bits and seamless integration with Transform…
☆712Updated this week