EmbeddedLLM / vllmLinks

vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs

☆90

Alternatives and similar repositories for vllm

Users that are interested in vllm are comparing it to the libraries listed below

Sorting:

IST-DASLab / qmoe
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
☆277Updated last year
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆266Updated last year
mlc-ai / llm-perf-bench
☆120Updated last year
rafacelente / bllama
1.58-bit LLaMa model
☆83Updated last year
GreenBitAI / low_bit_llama
Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs
☆110Updated last year
apple / ml-recurrent-drafter
☆218Updated 9 months ago
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆202Updated 4 months ago
Cornell-RelaxML / QuIP
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
☆385Updated last year
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆156Updated last year
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆238Updated 11 months ago
fw-ai / benchmark
Benchmark suite for LLMs from Fireworks.ai
☆83Updated 2 weeks ago
intel / neural-speed
An innovative library for efficient LLM inference via low-bit quantization
☆349Updated last year
huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…
☆318Updated last month
mani-kantap / llm-inference-solutions
A collection of all available inference solutions for the LLMs
☆91Updated 7 months ago
astramind-ai / BitMat
An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
☆154Updated last year
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆183Updated this week
chu-tianxiang / QuIP-for-all
QuIP quantization
☆59Updated last year
dust-tt / llama-ssp
Experiments on speculative sampling with Llama models
☆125Updated 2 years ago
chu-tianxiang / llama-cpp-torch
llama.cpp to PyTorch Converter
☆34Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆84Updated last year
LLM360 / amber-data-prep
Data preparation code for Amber 7B LLM
☆93Updated last year
IST-DASLab / Quartet
☆103Updated this week
wejoncy / QLLM
A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.
☆180Updated 6 months ago
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆202Updated last year
Infini-AI-Lab / Sequoia
scalable and robust tree-based speculative decoding algorithm
☆361Updated 9 months ago
Cornell-RelaxML / qtip
☆152Updated 4 months ago
amazon-science / mxfp4-llm
Official implementation for Training LLMs with MXFP4
☆100Updated 6 months ago
AlpinDale / QuIP-for-Llama
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models
☆40Updated 2 years ago
LambdaLabsML / llama
Inference code for LLaMA models
☆42Updated 2 years ago
intel / llm-on-ray
Pretrain, finetune and serve LLMs on Intel platforms with Ray
☆132Updated last month