NoakLiu / PiKVLinks
PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]
☆48Updated 2 months ago
Alternatives and similar repositories for PiKV
Users that are interested in PiKV are comparing it to the libraries listed below
Sorting:
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆96Updated 3 weeks ago
- Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding☆74Updated last month
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆50Updated 5 months ago
- ☆52Updated 7 months ago
- Artifact for "Marconi: Prefix Caching for the Era of Hybrid LLMs" [MLSys '25 Outstanding Paper Award, Honorable Mention]☆48Updated 10 months ago
- PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (KDD 2025)☆30Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆136Updated last year
- ☆22Updated 10 months ago
- ☆48Updated last year
- ☆81Updated 2 months ago
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆75Updated 2 weeks ago
- Distributed MoE in a Single Kernel [NeurIPS '25]☆174Updated this week
- Efficient Compute-Communication Overlap for Distributed LLM Inference☆67Updated 2 months ago
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆83Updated last month
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆160Updated 2 months ago
- Compression for Foundation Models☆35Updated 5 months ago
- Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.☆110Updated last week
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆175Updated last year
- DeeperGEMM: crazy optimized version☆74Updated 8 months ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆206Updated last year
- 16-fold memory access reduction with nearly no loss☆109Updated 9 months ago
- [HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆76Updated 3 weeks ago
- Quantized Attention on GPU☆44Updated last year
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆67Updated last year
- ☆116Updated 7 months ago
- High-performance distributed data shuffling (all-to-all) library for MoE training and inference☆102Updated last week
- Code repo for efficient quantized MoE inference with mixture of low-rank compensators☆28Updated 8 months ago
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆47Updated last year
- [ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference☆55Updated last year
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆123Updated 6 months ago