machilusZ/FastGen

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/machilusZ/FastGen)

machilusZ / FastGen

This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

☆44

Alternatives and similar repositories for FastGen

Users that are interested in FastGen are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

VITA-Group / Q-Hitter
View on GitHub
☆15Jun 4, 2024Updated 2 years ago
zyxxmu / cam
View on GitHub
Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference
☆50Jun 19, 2024Updated 2 years ago
AnswerDotAI / cold-compress
View on GitHub
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…
☆153Aug 9, 2024Updated last year
d-matrix-ai / keyformer-llm
View on GitHub
Keyformer proposes KV Cache reduction through key tokens identification and without the need for fine-tuning
☆57Mar 26, 2024Updated 2 years ago
mit-han-lab / Quest
View on GitHub
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆400Jul 10, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
shoaibahmed / llm_depth_pruning
View on GitHub
Official implementation of the paper: "A deeper look at depth pruning of LLMs"
☆15Jul 24, 2024Updated last year
FasterDecoding / SnapKV
View on GitHub
☆324Jul 10, 2025Updated last year
SqueezeAILab / KVQuant
View on GitHub
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆431Aug 13, 2024Updated last year
PositionalHidden / PositionalHidden
View on GitHub
To mitigate position bias in LLMs, especially in long-context scenarios, we scale only one dimension of LLMs, reducing position bias and …
☆12Jun 18, 2024Updated 2 years ago
FMInference / H2O
View on GitHub
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
☆528Aug 1, 2024Updated last year
hdong920 / LESS
View on GitHub
☆53May 13, 2024Updated 2 years ago
Zoeyyao27 / SirLLM
View on GitHub
This repository contains the code for the paper: SirLLM: Streaming Infinite Retentive LLM
☆60May 28, 2024Updated 2 years ago
olemon111 / WIPE
View on GitHub
WIPE implementation
☆13Nov 26, 2023Updated 2 years ago
Adaxry / Unified_Layer_Skipping
View on GitHub
☆15Apr 11, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
YJHMITWEB / ExFlow
View on GitHub
Explore Inter-layer Expert Affinity in MoE Model Inference
☆16May 6, 2024Updated 2 years ago
AboveParadise / LLMCBench
View on GitHub
☆28Dec 2, 2024Updated last year
hetailang / SqueezeAttention
View on GitHub
☆37Oct 10, 2024Updated last year
drarijitdas / Natural-GaLore
View on GitHub
An extention to the GaLore paper, to perform Natural Gradient Descent in low rank subspace
☆19Oct 21, 2024Updated last year
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆418Nov 20, 2025Updated 8 months ago
p1nksnow / MoICE
View on GitHub
Official implementation for "Mixture of In-Context Experts Enhance LLMs’ Awareness of Long Contexts" (Accepted by Neurips2024)
☆14Jan 7, 2025Updated last year
facebookresearch / Ternary_Binary_Transformer
View on GitHub
ACL 2023
☆39Jun 6, 2023Updated 3 years ago
kssteven418 / SqueezeLLM-gradients
View on GitHub
☆21Feb 5, 2024Updated 2 years ago
SusCom-Lab / ZSMerge
View on GitHub
☆23Sep 24, 2025Updated 9 months ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
Qualcomm-AI-research / lr-qat
View on GitHub
☆54Nov 5, 2024Updated last year
UNITES-Lab / C2R-MoE
View on GitHub
[NAACL'25 🏆 SAC Award] Official code for "Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert…
☆16Feb 4, 2025Updated last year
Infini-AI-Lab / MagicDec
View on GitHub
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆154Dec 4, 2024Updated last year
Trustworthy-ML-Lab / ThinkEdit
View on GitHub
[EMNLP 25] An effective and interpretable weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study un…
☆19Dec 17, 2025Updated 7 months ago
ClubieDong / QAQ-KVCacheQuantization
View on GitHub
QAQ: Quality Adaptive Quantization for LLM KV Cache
☆54Mar 27, 2024Updated 2 years ago
ChenMnZ / PrefixQuant
View on GitHub
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆176Nov 26, 2025Updated 7 months ago
Nota-NetsPresso / shortened-llm
View on GitHub
Compressed LLMs for Efficient Text Generation [ICLR'24 Workshop]
☆90Sep 13, 2024Updated last year
microsoft / MInference
View on GitHub
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,221Apr 8, 2026Updated 3 months ago
FFY0 / AdaKV
View on GitHub
The Official Implementation of Ada-KV [NeurIPS 2025]
☆139Nov 26, 2025Updated 7 months ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
OpenBitSys / BitDistiller
View on GitHub
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
☆139May 16, 2024Updated 2 years ago
XIANGLONGYAN / PBS2P
View on GitHub
PyTorch code for our paper "Progressive Binarization with Semi-Structured Pruning for LLMs"
☆13Jul 11, 2026Updated last week
IST-DASLab / gemm-fp8
View on GitHub
High Performance FP8 GEMM Kernels for SM89 and later GPUs.
☆21Jan 24, 2025Updated last year
dongwonjo / FastKV
View on GitHub
[ACL Findings 2026] Official Implementation of "FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acc…
☆32Apr 14, 2026Updated 3 months ago
mutonix / pyramidinfer
View on GitHub
☆47Nov 25, 2024Updated last year
xvyaward / owq
View on GitHub
Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…
☆72Mar 7, 2024Updated 2 years ago
Infini-AI-Lab / Kinetics
View on GitHub
Kinetics: Rethinking Test-Time Scaling Laws
☆87Jul 11, 2025Updated last year