AutonomicPerfectionist / PipeInferLinks

PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

☆31

Alternatives and similar repositories for PipeInfer

Users that are interested in PipeInfer are comparing it to the libraries listed below

Sorting:

hao-ai-lab / MuxServe
☆79Updated last month
tyler-griggs / melange-release
☆48Updated last year
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆90Updated 5 months ago
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆62Updated 2 months ago
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆49Updated 3 months ago
DD-DuDa / BitDecoding
[HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆63Updated last week
YJHMITWEB / ExFlow
Explore Inter-layer Expert Affinity in MoE Model Inference
☆15Updated last year
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆196Updated last year
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆131Updated 11 months ago
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆80Updated 9 months ago
ruipeterpan / marconi
Artifact for "Marconi: Prefix Caching for the Era of Hybrid LLMs" [MLSys '25 Outstanding Paper Award, Honorable Mention]
☆46Updated 8 months ago
WukLab / preble
Stateful LLM Serving
☆89Updated 8 months ago
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆243Updated last year
abhibambhaniya / GenZ-LLM-Analyzer
LLM Inference analyzer for different hardware platforms
☆96Updated 4 months ago
d-matrix-ai / keyformer-llm
☆58Updated last year
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆170Updated last year
DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆82Updated this week
INT-FlashAttention2024 / INT-FlashAttention
☆83Updated 10 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆108Updated 8 months ago
EfficientMoE / MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
☆262Updated last month
LoongServe / LoongServe
☆123Updated last year
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆64Updated last year
InternLM / Awesome-LLM-Training-System
☆44Updated last year
Relaxed-System-Lab / HexGen
[ICML 2024] Serving LLMs on heterogeneous decentralized clusters.
☆31Updated last year
UChi-JCL / CacheGen
☆140Updated last year
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆59Updated 8 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆224Updated 2 years ago
ranggihwang / Pregated_MoE
☆57Updated last year
microsoft / RetrievalAttention
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
☆99Updated 2 months ago
flashinfer-ai / cutlass-viz
☆65Updated 7 months ago