ROCm / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆64Updated this week
Alternatives and similar repositories for vllm:
Users that are interested in vllm are comparing it to the libraries listed below
- Ahead of Time (AOT) Triton Math Library☆52Updated this week
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆97Updated 7 months ago
- ☆67Updated 3 months ago
- Development repository for the Triton language and compiler☆105Updated this week
- ☆69Updated last month
- ☆180Updated 7 months ago
- Fast and memory-efficient exact attention☆154Updated this week
- ☆42Updated last month
- ☆34Updated this week
- ☆18Updated this week
- ☆81Updated 5 months ago
- OpenAI Triton backend for Intel® GPUs☆165Updated this week
- Ongoing research training transformer models at scale☆15Updated this week
- Applied AI experiments and examples for PyTorch☆223Updated this week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆38Updated 9 months ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆100Updated 2 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆234Updated 3 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆103Updated 5 months ago
- Shared Middle-Layer for Triton Compilation☆224Updated this week
- extensible collectives library in triton☆82Updated 4 months ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆17Updated this week
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆294Updated this week
- An experimental CPU backend for Triton☆87Updated 3 weeks ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆86Updated this week
- ☆19Updated this week
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆11Updated 7 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆285Updated this week
- ☆140Updated 9 months ago
- Fast and memory-efficient exact attention☆44Updated this week
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆495Updated 3 weeks ago