dongwonjo / FastKVLinks

Official Implementation of FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

☆29

Alternatives and similar repositories for FastKV

Users that are interested in FastKV are comparing it to the libraries listed below

Sorting:

BaohaoLiao / ApiQ
[EMNLP 2024] Quantize LLM to extremely low-bit, and finetune the quantized LLMs
☆15Updated last year
AkideLiu / MiniCache
☆10Updated last year
ArminAzizi98 / LaMDA
☆15Updated last year
Lucky-Lance / SPP
[ICML 2024] SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
☆21Updated last year
OpenSparseLLMs / Linearization
☆62Updated 5 months ago
Linking-ai / SCOPE
(ACL 2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generation
☆33Updated 7 months ago
pprp / Pruner-Zero
[ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs
☆98Updated last year
pixeli99 / MixLN
[ICLR 2025] Official Pytorch Implementation of "Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN" by Pengxia…
☆27Updated 5 months ago
GATECH-EIC / Linearized-LLM
[ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
☆35Updated last year
LCM-Lab / LOGO
Code for paper: Long cOntext aliGnment via efficient preference Optimization
☆23Updated 2 months ago
abdelfattah-lab / TokenButler
☆26Updated last month
jiwonsong-dev / ReasoningPathCompression
[NeurIPS 2025] Official implementation of "Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning"
☆26Updated 2 months ago
MIRALab-USTC / LLM-AttentionPredictor
The code for "AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference", Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Ch…
☆25Updated 5 months ago
Infini-AI-Lab / Kinetics
Kinetics: Rethinking Test-Time Scaling Laws
☆84Updated 5 months ago
SqueezeAILab / SqueezedAttention
[ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference
☆55Updated last year
thu-ml / TetraJet-MXFP4Training
Pytorch implementation of "Oscillation-Reduced MXFP4 Training for Vision Transformers" on DeiT Model Pre-training
☆34Updated 6 months ago
machilusZ / FastGen
This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
☆43Updated last year
Infini-AI-Lab / S2FT
☆19Updated 11 months ago
hahnyuan / ASVD4LLM
Activation-aware Singular Value Decomposition for Compressing Large Language Models
☆82Updated last year
Jikai0Wang / OPT-Tree
☆29Updated 7 months ago
Intelligent-Computing-Lab-Panda / TesseraQ
☆24Updated last year
ylsung / rsq
Code for "RSQ: Learning from Important Tokens Leads to Better Quantized LLMs"
☆20Updated 6 months ago
sail-sg / SimLayerKV
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
☆52Updated last year
sramshetty / mixture-of-depths
An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆36Updated last year
SalesforceAIResearch / GemFilter
☆85Updated last month
jiwonsong-dev / SLEB
[ICML 2024] Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks
☆38Updated 10 months ago
deep-spin / adasplash
AdaSplash: Adaptive Sparse Flash Attention (aka Flash Entmax Attention)
☆32Updated 3 months ago
UtkarshSaxena1 / EigenAttn
☆19Updated last year
UNITES-Lab / MC-SMoE
[ICLR‘24 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"
☆102Updated 6 months ago
JarvisPei / CMoE
Implementation for the paper: CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference
☆31Updated 9 months ago