Snowflake-Labs / vllmLinks

☆15

Alternatives and similar repositories for vllm

Users that are interested in vllm are comparing it to the libraries listed below

Sorting:

fw-ai / benchmark
Benchmark suite for LLMs from Fireworks.ai
☆76Updated this week
snowflakedb / ArcticTraining
ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)
☆190Updated this week
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆61Updated 9 months ago
IBM / text-generation-inference
IBM development fork of https://github.com/huggingface/text-generation-inference
☆61Updated 2 months ago
Michaelvll / llm-ie-benchmarks
A collection of reproducible inference engine benchmarks
☆32Updated 3 months ago
google-deepmind / asyncdiloco
☆45Updated last year
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆203Updated this week
huggingface / optimum-tpu
Google TPU optimizations for transformers models
☆117Updated 6 months ago
IlyasMoutawwakil / llm-perf-backend
The backend behind the LLM-Perf Leaderboard
☆10Updated last year
facebookresearch / matrix
Matrix (Multi-Agent daTa geneRation Infra and eXperimentation framework) is a versatile engine for multi-agent conversational data genera…
☆78Updated last week
cray-lm / cray-lm
Cray-LM unified training and inference stack.
☆22Updated 6 months ago
withmartian / routerbench
The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System
☆131Updated last year
kevinwu23 / StanfordFineTuneBench
☆31Updated 8 months ago
opendatahub-io / vllm-tgis-adapter
vLLM adapter for a TGIS-compatible gRPC server.
☆33Updated this week
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆198Updated last month
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆265Updated 9 months ago
RAIVNLab / AdANNS
Code repository for the paper - "AdANNS: A Framework for Adaptive Semantic Search"
☆65Updated last year
zipnn / zipnn
A Lossless Compression Library for AI pipelines
☆272Updated last month
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆141Updated this week
IST-DASLab / Quartet
☆76Updated last month
hamelsmu / llama-inference
experiments with inference on llama
☆104Updated last year
Upaya07 / NeurIPS-llm-efficiency-challenge
Code for NeurIPS LLM Efficiency Challenge
☆59Updated last year
foundation-model-stack / bamba
Train, tune, and infer Bamba model
☆130Updated 2 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆123Updated 8 months ago
jlscheerer / xtr-warp
XTR/WARP (SIGIR'25) is an extremely fast and accurate retrieval engine based on Stanford's ColBERTv2/PLAID and Google DeepMind's XTR.
☆152Updated 3 months ago
IlyasMoutawwakil / py-txi
A Python wrapper around HuggingFace's TGI (text-generation-inference) and TEI (text-embedding-inference) servers.
☆33Updated 2 months ago
chainyo / tensorshare
🤝 Trade any tensors over the network
☆30Updated last year
DeepAuto-AI / sglang
This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.
☆15Updated this week
SeunghyunSEO / optimized_hf_llama_class_for_training
☆48Updated 11 months ago