ROCm / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆57Updated this week
Alternatives and similar repositories for vllm:
Users that are interested in vllm are comparing it to the libraries listed below
- Development repository for the Triton language and compiler☆104Updated this week
- ☆19Updated this week
- ☆18Updated 2 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆232Updated 3 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆94Updated 6 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆50Updated this week
- ☆64Updated 2 months ago
- ☆180Updated 6 months ago
- Ongoing research training transformer models at scale☆13Updated this week
- Fast and memory-efficient exact attention☆152Updated this week
- ☆15Updated this week
- Dynamic Memory Management for Serving LLMs without PagedAttention☆274Updated this week
- OpenAI Triton backend for Intel® GPUs☆157Updated this week
- ☆67Updated last month
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆292Updated this week
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆219Updated 2 weeks ago
- ☆79Updated 4 months ago
- Applied AI experiments and examples for PyTorch☆216Updated last week
- ☆34Updated this week
- LLM Inference analyzer for different hardware platforms☆47Updated this week
- Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators☆338Updated this week
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆95Updated last month
- A low-latency & high-throughput serving engine for LLMs☆301Updated 4 months ago
- ☆84Updated 9 months ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆54Updated 4 months ago
- ☆140Updated 9 months ago
- RCCL Performance Benchmark Tests☆55Updated 2 weeks ago
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆492Updated this week
- hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditiona…☆73Updated this week
- collection of benchmarks to measure basic GPU capabilities☆287Updated 3 weeks ago