snowflakedb / ArcticInferenceLinks

ArcticInference: vLLM plugin for high-throughput, low-latency inference

☆203

Alternatives and similar repositories for ArcticInference

Users that are interested in ArcticInference are comparing it to the libraries listed below

Sorting:

snowflakedb / ArcticTraining
ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)
☆190Updated this week
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆79Updated 5 months ago
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆134Updated 6 months ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆123Updated 8 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆385Updated last month
NVIDIA / kvpress
LLM KV cache compression made easy
☆566Updated this week
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆265Updated 9 months ago
sgl-project / genai-bench
Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…
☆180Updated this week
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆224Updated 8 months ago
neuralmagic / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆141Updated last week
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆400Updated 2 months ago
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆405Updated 2 months ago
hao-ai-lab / Dynasor
Simple extension on vLLM to help you speed up reasoning model without training.
☆172Updated 2 months ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆206Updated last week
Infini-AI-Lab / MagicPIG
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
☆228Updated 7 months ago
EfficientMoE / MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
☆215Updated 3 weeks ago
ppl-ai / pplx-kernels
Perplexity GPU Kernels
☆418Updated 2 weeks ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆80Updated 11 months ago
fw-ai / benchmark
Benchmark suite for LLMs from Fireworks.ai
☆76Updated this week
vllm-project / dashboard
vLLM performance dashboard
☆33Updated last year
MDK8888 / vllmini
A minimal implementation of vllm.
☆50Updated last year
foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆258Updated last week
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆365Updated 11 months ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆338Updated last week
tyler-griggs / melange-release
☆47Updated last year
neuralmagic / AutoFP8
☆195Updated 2 months ago
IST-DASLab / MoE-Quant
Code for data-aware compression of DeepSeek models
☆40Updated last month
apple / ml-recurrent-drafter
☆215Updated 6 months ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆856Updated 3 weeks ago
vllm-project / recipes
Common recipes to run vLLM
☆68Updated this week