Supercomputing-System-AI-Lab / MiLoLinks
Code repo for efficient quantized MoE inference with mixture of low-rank compensators
☆24Updated 5 months ago
Alternatives and similar repositories for MiLo
Users that are interested in MiLo are comparing it to the libraries listed below
Sorting:
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆47Updated last month
- [ACL 2025 main] FR-Spec: Frequency-Ranked Speculative Sampling☆44Updated 2 months ago
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)☆154Updated last year
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆335Updated 2 months ago
- A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆59Updated 3 weeks ago
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆71Updated 3 months ago
- 16-fold memory access reduction with nearly no loss☆105Updated 6 months ago
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection☆140Updated 7 months ago
- Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark o…☆84Updated 7 months ago
- The Official Implementation of Ada-KV [NeurIPS 2025]☆95Updated last week
- ☆71Updated last year
- ☆55Updated last year
- An experimentation platform for LLM inference optimisation☆33Updated last year
- ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction (NIPS'24)☆43Updated 9 months ago
- ☆52Updated last year
- ☆24Updated 6 months ago
- Explore Inter-layer Expert Affinity in MoE Model Inference☆14Updated last year
- PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]☆36Updated this week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆142Updated 4 months ago
- ☆58Updated 9 months ago
- Curated collection of papers in MoE model inference☆265Updated last week
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding co…☆205Updated last month
- ☆14Updated last year
- ☆43Updated 4 months ago
- ☆28Updated 4 months ago
- ☆137Updated 2 months ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆113Updated 5 months ago
- ☆18Updated 6 months ago
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆270Updated 6 months ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆57Updated 6 months ago