neuralmagic/nm-vllm

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/neuralmagic/nm-vllm)

neuralmagic / nm-vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

☆267

Alternatives and similar repositories for nm-vllm

Users that are interested in nm-vllm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

IST-DASLab / Sparse-Marlin
View on GitHub
Boosting 4-bit inference kernels with 2:4 Sparsity
☆96Sep 4, 2024Updated last year
neuralmagic / AutoFP8
View on GitHub
☆210May 5, 2025Updated last year
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,109Sep 4, 2024Updated last year
vllm-project / compressed-tensors
View on GitHub
A safetensors extension to efficiently store sparse quantized tensors on disk
☆303Updated this week
AlibabaPAI / FLASHNN
View on GitHub
☆106Sep 9, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
neuralmagic / sparsify
View on GitHub
ML model optimization product to accelerate inference.
☆325Jun 2, 2025Updated last year
efeslab / Nanoflow
View on GitHub
A throughput-oriented high-performance serving framework for LLMs
☆968Mar 29, 2026Updated 3 months ago
IsaacRe / vllm-kvcompress
View on GitHub
KV cache compression for high-throughput LLM inference
☆158Feb 5, 2025Updated last year
AniZpZ / AutoSmoothQuant
View on GitHub
An easy-to-use package for implementing SmoothQuant for LLMs
☆111Apr 7, 2025Updated last year
neuralmagic / deepsparse
View on GitHub
Sparsity-aware deep learning inference runtime for CPUs
☆3,160Jun 2, 2025Updated last year
feifeibear / ChituAttention
View on GitHub
Quantized Attention on GPU
☆45Nov 22, 2024Updated last year
vllm-project / llm-compressor
View on GitHub
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆3,566Updated this week
Infini-AI-Lab / TriForce
View on GitHub
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
☆281Aug 31, 2024Updated last year
SqueezeBits / QUICK
View on GitHub
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆123Mar 6, 2024Updated 2 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
NetEase-FuXi / EETQ
View on GitHub
Easy and Efficient Quantization for Transformers
☆205Mar 25, 2026Updated 3 months ago
neuralmagic / sparseml
View on GitHub
Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
☆2,145Jun 2, 2025Updated last year
neuralmagic / sparsezoo
View on GitHub
Neural network model repository for highly sparse and sparse-quantized models with matching sparsification recipes
☆388Jun 2, 2025Updated last year
hao-ai-lab / LookaheadDecoding
View on GitHub
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,340Mar 6, 2025Updated last year
HandH1998 / QQQ
View on GitHub
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆157Aug 21, 2025Updated 11 months ago
punica-ai / punica
View on GitHub
Serving multiple LoRA finetuned LLM as one
☆1,166May 8, 2024Updated 2 years ago
Infini-AI-Lab / MagicDec
View on GitHub
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆154Dec 4, 2024Updated last year
metterian / korean_bert_score
View on GitHub
BERT score for text generation
☆12Jan 15, 2025Updated last year
IST-DASLab / EvoPress
View on GitHub
☆43Jun 14, 2026Updated last month
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
russellb / canhazgpu
View on GitHub
A simple GPU reservation tool for single host shared development systems
☆29Jul 6, 2026Updated 2 weeks ago
HanGuo97 / flute
View on GitHub
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
☆391Apr 13, 2025Updated last year
dropbox / hqq
View on GitHub
Official implementation of Half-Quadratic Quantization (HQQ)
☆949Feb 26, 2026Updated 4 months ago
IST-DASLab / qmoe
View on GitHub
Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".
☆278Nov 3, 2023Updated 2 years ago
luyug / magix
View on GitHub
Supercharge huggingface transformers with model parallelism.
☆77Jul 23, 2025Updated 11 months ago
neuralmagic / docs
View on GitHub
Top-level directory for documentation and general content
☆120Jun 2, 2025Updated last year
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆5,994Updated this week
nyunAI / PruneGPT
View on GitHub
☆50May 31, 2024Updated 2 years ago
efeslab / Atom
View on GitHub
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆344Jul 2, 2024Updated 2 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
hyunwoongko / beyond-lm
View on GitHub
Beyond LM: How can language model go forward in the future?
☆15Apr 30, 2023Updated 3 years ago
vllm-project / speculators
View on GitHub
A unified library for building, evaluating, and storing speculative decoding algorithms for LLM inference in vLLM
☆633Updated this week
KyujinHan / Sakura-SOLAR-DPO
View on GitHub
Sakura-SOLAR-DPO: Merge, SFT, and DPO
☆116Dec 30, 2023Updated 2 years ago
premAI-io / benchmarks
View on GitHub
🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.
☆141Jul 25, 2024Updated last year
opengear-project / GEAR
View on GitHub
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆183Jul 12, 2024Updated 2 years ago
SpRegTiling / sparse-register-tiling
View on GitHub
☆10Mar 2, 2024Updated 2 years ago
usyd-fsalab / fp6_llm
View on GitHub
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆282Jul 16, 2025Updated last year