microsoft / MInferenceLinks

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

☆1,143

Alternatives and similar repositories for MInference

Users that are interested in MInference are comparing it to the libraries listed below

Sorting:

efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆909Updated last week
hao-ai-lab / LookaheadDecoding
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,291Updated 7 months ago
jzhang38 / EasyContext
Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.
☆750Updated last year
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆770Updated 7 months ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆922Updated last year
allenai / OLMoE
OLMoE: Open Mixture-of-Experts Language Models
☆888Updated last month
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆496Updated 8 months ago
mlc-ai / xgrammar
Fast, Flexible and Portable Structured Generation
☆1,323Updated last week
SafeAILab / EAGLE
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).
☆2,008Updated 2 weeks ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆390Updated 4 months ago
vectorch-ai / ScaleLLM
A high-performance inference system for large language models, designed for production environments.
☆479Updated 3 weeks ago
OpenGVLab / OmniQuant
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
☆860Updated 5 months ago
microsoft / TransformerCompression
For releasing code related to compression methods for transformers, accompanying our publications
☆447Updated 9 months ago
microsoft / VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
☆659Updated 6 months ago
punica-ai / punica
Serving multiple LoRA finetuned LLM as one
☆1,106Updated last year
zhuzilin / ring-flash-attention
Ring attention implementation with flash attention
☆903Updated last month
ModelCloud / GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…
☆852Updated this week
thunlp / InfLLM
The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Mem…
☆385Updated last year
vllm-project / llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆2,149Updated this week
NVlabs / Minitron
A family of compressed models obtained via pruning and knowledge distillation
☆354Updated 11 months ago
THUDM / LongBench
LongBench v2 and LongBench (ACL 25'&24')
☆1,005Updated 9 months ago
arcee-ai / DistillKit
An Open Source Toolkit For LLM Distillation
☆744Updated 3 months ago
feifeibear / LLMSpeculativeSampling
Fast inference from large lauguage models via speculative decoding
☆841Updated last year
MoonshotAI / Moonlight
Muon is Scalable for LLM Training
☆1,342Updated 2 months ago
NVIDIA / NeMo-Aligner
Scalable toolkit for efficient model alignment
☆841Updated 3 weeks ago
princeton-nlp / LLM-Shearing
[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
☆631Updated last year
hahnyuan / LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…
☆568Updated last year
intel / auto-round
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.
☆679Updated this week
pjlab-sys4nlp / llama-moe
⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)
☆994Updated 10 months ago
huggingface / nanotron
Minimalistic large language model 3D-parallelism training
☆2,274Updated last month