ROCm / vllmLinks
A high-throughput and memory-efficient inference and serving engine for LLMs
☆85Updated this week
Alternatives and similar repositories for vllm
Users that are interested in vllm are comparing it to the libraries listed below
Sorting:
- Development repository for the Triton language and compiler☆125Updated this week
- Fast and memory-efficient exact attention☆177Updated this week
- ☆40Updated this week
- AI Tensor Engine for ROCm☆232Updated this week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆255Updated 8 months ago
- OpenAI Triton backend for Intel® GPUs☆193Updated this week
- ☆100Updated 6 months ago
- ☆74Updated 3 months ago
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆61Updated 2 weeks ago
- Fast and memory-efficient exact attention☆81Updated last week
- ☆83Updated 8 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆112Updated last year
- Ongoing research training transformer models at scale☆25Updated last week
- llama INT4 cuda inference with AWQ☆54Updated 5 months ago
- ☆21Updated last week
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆135Updated 3 months ago
- A lightweight design for computation-communication overlap.☆148Updated 3 weeks ago
- Ahead of Time (AOT) Triton Math Library☆70Updated last week
- An experimental CPU backend for Triton☆135Updated last month
- Standalone Flash Attention v2 kernel without libtorch dependency☆111Updated 10 months ago
- ☆195Updated 2 months ago
- ☆31Updated 5 months ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆651Updated last week
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆251Updated 3 weeks ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆216Updated last year
- ☆96Updated 10 months ago
- ☆216Updated last year
- Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators☆437Updated this week
- ☆162Updated last week
- RCCL Performance Benchmark Tests☆70Updated this week