Leosang-lx / FlowSpecLinks
Continuous Pipelined Speculative Decoding
☆17Updated last week
Alternatives and similar repositories for FlowSpec
Users that are interested in FlowSpec are comparing it to the libraries listed below
Sorting:
- ☆29Updated 7 months ago
- AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference☆20Updated 11 months ago
- Code repo for efficient quantized MoE inference with mixture of low-rank compensators☆28Updated 8 months ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆50Updated 5 months ago
- Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automaton☆39Updated 10 months ago
- ☆36Updated 10 months ago
- [ACL 2025 main] FR-Spec: Frequency-Ranked Speculative Sampling☆49Updated 5 months ago
- Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding☆74Updated last month
- Source code of paper ''KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing''☆31Updated last year
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆96Updated 3 weeks ago
- PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (KDD 2025)☆30Updated last year
- ☆49Updated 7 months ago
- [NAACL'25 🏆 SAC Award] Official code for "Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert…☆13Updated 11 months ago
- ☆22Updated 10 months ago
- [COLM 2024] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆25Updated last year
- ☆49Updated last year
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆52Updated last year
- The Official Implementation of Ada-KV [NeurIPS 2025]☆123Updated last month
- PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]☆48Updated 2 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆175Updated last year
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆160Updated 2 months ago
- [NeurIPS ENLSP Workshop'24] CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios☆16Updated last year
- Multi-Candidate Speculative Decoding☆39Updated last year
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆110Updated 2 months ago
- This repository is the official implementation of "Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE"☆36Updated 3 months ago
- Keyformer proposes KV Cache reduction through key tokens identification and without the need for fine-tuning☆59Updated last year
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆47Updated last year
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection☆151Updated 10 months ago
- QAQ: Quality Adaptive Quantization for LLM KV Cache☆55Updated last year
- EE-LLM is a framework for large-scale training and inference of early-exit (EE) large language models (LLMs).☆73Updated last year