ydyhello / TailorKVLinks
Official implementation of "TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization" (Findings of ACL 2025).
☆20Updated 5 months ago
Alternatives and similar repositories for TailorKV
Users that are interested in TailorKV are comparing it to the libraries listed below
Sorting:
- ☆10Updated last year
- Official PyTorch implementation of the paper "Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Princ…☆37Updated 5 months ago
- [NeurIPS 2024] The official implementation of ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification☆32Updated 9 months ago
- [NeurIPS'25] The official code implementation for paper "R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Tok…☆72Updated this week
- Kinetics: Rethinking Test-Time Scaling Laws☆85Updated 6 months ago
- (ACL 2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generation☆33Updated 7 months ago
- ☆53Updated last year
- [ICLR 2025] SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration☆61Updated 10 months ago
- ☆62Updated 6 months ago
- [ICLR 2025] Mixture Compressor for Mixture-of-Experts LLMs Gains More☆65Updated 10 months ago
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs☆43Updated last year
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆47Updated last year
- [NeurIPS 2025] Official implementation of "Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning"☆26Updated 2 months ago
- [ICLR‘24 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"☆102Updated 6 months ago
- dParallel: Learnable Parallel Decoding for dLLMs☆53Updated 2 months ago
- ☆72Updated 6 months ago
- Official Pytorch Implementation of Our Paper Accepted at ICLR 2024-- Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLM…☆50Updated last year
- PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [NeurIPS '25]☆61Updated 3 months ago
- Official Implementation of FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration☆29Updated last month
- [ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference☆55Updated last year
- Official implementation of paper "Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models"☆56Updated 2 weeks ago
- The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".☆88Updated 9 months ago
- [CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆153Updated last month
- Efficient LLM query routing via multi-sampling. BEST-Route selects both model and number of responses based on query difficulty, cutting …☆38Updated 5 months ago
- [NeurIPS'25] dKV-Cache: The Cache for Diffusion Language Models☆128Updated 7 months ago
- ☆109Updated 3 months ago
- [ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆98Updated last year
- [ICML'25] Our study systematically investigates massive values in LLMs' attention mechanisms. First, we observe massive values are concen…☆85Updated 6 months ago
- A lightweight Inference Engine built for block diffusion models☆39Updated last month
- Source code for the paper "LongGenBench: Long-context Generation Benchmark"☆24Updated last year