Supercomputing-System-AI-Lab / MiLoLinks

Code repo for efficient quantized MoE inference with mixture of low-rank compensators

☆25

Alternatives and similar repositories for MiLo

Users that are interested in MiLo are comparing it to the libraries listed below

Sorting:

DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆48Updated 2 months ago
DD-DuDa / BitDecoding
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆60Updated last week
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆105Updated 6 months ago
thunlp / FR-Spec
[ACL 2025 main] FR-Spec: Frequency-Ranked Speculative Sampling
☆45Updated 3 months ago
hao-ai-lab / MuxServe
☆72Updated last year
FFY0 / AdaKV
The Official Implementation of Ada-KV [NeurIPS 2025]
☆105Updated 3 weeks ago
snu-comparch / InfiniGen
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)
☆155Updated last year
shadowpa0327 / Palu
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆142Updated 8 months ago
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆338Updated 3 months ago
Adaxry / Unified_Layer_Skipping
☆14Updated last year
NoakLiu / PiKV
PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]
☆40Updated this week
smart-lty / ParallelSpeculativeDecoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆118Updated 6 months ago
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆59Updated 6 months ago
d-matrix-ai / keyformer-llm
☆59Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆82Updated 8 months ago
amazon-science / piperag
PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (KDD 2025)
☆26Updated last year
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆72Updated 4 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆169Updated last year
TreeAI-Lab / Awesome-KV-Cache-Management
This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding co…
☆219Updated 2 months ago
zcli-charlie / Awesome-KV-Cache
☆79Updated last year
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆221Updated 2 years ago
HarryWu99 / llm_kvcache_sparsity
Implement some method of LLM KV Cache Sparsity
☆39Updated last year
YaoJiayi / CacheBlend
☆140Updated 3 months ago
henryzhongsc / longctx_bench
Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark o…
☆86Updated 7 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆143Updated this week
cat538 / SKVQ
[COLM 2024] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
☆24Updated last year
PanZaifeng / FastTree-Artifact
☆25Updated 6 months ago
ranggihwang / Pregated_MoE
☆55Updated last year
SqueezeAILab / SqueezedAttention
[ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference
☆54Updated 10 months ago
ASISys / AdaSkip
AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
☆15Updated 8 months ago