ROCm / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆35Updated this week
Related projects: ⓘ
- Development repository for the Triton language and compiler☆86Updated this week
- Fast and memory-efficient exact attention☆126Updated this week
- A low-latency & high-throughput serving engine for LLMs☆174Updated last week
- Applied AI experiments and examples for PyTorch☆123Updated last month
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆173Updated 3 months ago
- OpenAI Triton backend for Intel® GPUs☆126Updated this week
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆399Updated 2 weeks ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆34Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆562Updated 2 weeks ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆186Updated last month
- ☆138Updated 2 months ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆233Updated this week
- An experimental CPU backend for Triton☆36Updated last week
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆166Updated 11 months ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆106Updated 6 months ago
- ☆102Updated 3 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆81Updated 2 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…☆276Updated last week
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆258Updated 2 months ago
- Experimental projects related to TensorRT☆62Updated this week
- Code for QuaRot, an end-to-end 4-bit inference of large language models.☆256Updated last month
- Latency and Memory Analysis of Transformer Models for Training and Inference☆338Updated 3 months ago
- ☆67Updated last week
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆342Updated this week
- Zero Bubble Pipeline Parallelism☆254Updated 2 weeks ago
- AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…☆188Updated 3 weeks ago
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆56Updated 3 weeks ago
- ☆140Updated 4 months ago
- A fast communication-overlapping library for tensor parallelism on GPUs.☆184Updated this week
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆141Updated 3 weeks ago