HabanaAI / vllm-forkLinks

A high-throughput and memory-efficient inference and serving engine for LLMs

☆78

Alternatives and similar repositories for vllm-fork

Users that are interested in vllm-fork are comparing it to the libraries listed below

Sorting:

microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆405Updated 2 months ago
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆400Updated 2 months ago
NVIDIA / nvidia-resiliency-ext
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …
☆196Updated this week
cli99 / llm-analysis
Latency and Memory Analysis of Transformer Models for Training and Inference
☆441Updated 3 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆216Updated last year
intel / intel-extension-for-deepspeed
Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…
☆61Updated last month
triton-inference-server / perf_analyzer
☆98Updated this week
run-ai / llmperf
☆58Updated 10 months ago
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆289Updated 2 months ago
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆203Updated this week
ppl-ai / pplx-kernels
Perplexity GPU Kernels
☆418Updated 2 weeks ago
huggingface / tgi-gaudi
Large Language Model Text Generation Inference on Habana Gaudi
☆34Updated 4 months ago
ai-dynamo / nixl
NVIDIA Inference Xfer Library (NIXL)
☆502Updated this week
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆411Updated 2 months ago
AlibabaPAI / llumnix
Efficient and easy multi-instance LLM serving
☆454Updated this week
microsoft / mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
☆394Updated this week
HabanaAI / gaudi-pytorch-bridge
☆17Updated last week
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆260Updated 2 weeks ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆206Updated last week
ColfaxResearch / cutlass-kernels
☆227Updated last year
yifuwang / symm-mem-recipes
☆102Updated 7 months ago
anyscale / llm-continuous-batching-benchmarks
☆120Updated last year
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…
☆307Updated 2 months ago
neuralmagic / AutoFP8
☆195Updated 2 months ago
facebookresearch / HolisticTraceAnalysis
A library to analyze PyTorch traces.
☆400Updated this week
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆198Updated last month
sgl-project / genai-bench
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…
☆180Updated this week
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆856Updated 3 weeks ago
NVIDIA / nvbandwidth
A tool for bandwidth measurements on NVIDIA GPUs.
☆496Updated 3 months ago