JerryYin777 / Cross-Layer-AttentionLinks

Self Reproduction Code of Paper "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention (MIT CSAIL)

☆18

Alternatives and similar repositories for Cross-Layer-Attention

Users that are interested in Cross-Layer-Attention are comparing it to the libraries listed below

Sorting:

JieShibo / MoLE
[ICML 2025 Oral] Mixture of Lookup Experts
☆53Updated 5 months ago
Espere-1119-Song / VideoNSA
VideoNSA: Native Sparse Attention Scales Video Understanding
☆50Updated last week
pprp / Pruner-Zero
[ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs
☆94Updated 10 months ago
fla-org / flash-bidirectional-linear-attention
Triton implement of bi-directional (non-causal) linear attention
☆56Updated 8 months ago
savadikarc / wegeft
WeGeFT: Weight‑Generative Fine‑Tuning for Multi‑Faceted Efficient Adaptation of Large Models
☆21Updated 3 months ago
Aaronhuang-778 / Mixture-Compressor-MoE
[ICLR 2025] Mixture Compressor for Mixture-of-Experts LLMs Gains More
☆57Updated 8 months ago
OpenSparseLLMs / Open-Pandora
Open-Pandora: On-the-fly Control Video Generation
☆34Updated 10 months ago
shoaibahmed / llm_depth_pruning
Official implementation of the paper: "A deeper look at depth pruning of LLMs"
☆15Updated last year
UNITES-Lab / C2R-MoE
[NAACL'25 🏆 SAC Award] Official code for "Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert…
☆10Updated 8 months ago
htqin / IR-QLoRA
[ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…
☆67Updated last year
ZihaoHuang-notabot / Ultra-Sparse-Memory-Network
☆28Updated last month
maomaocun / dLLM-cache
Official PyTorch implementation of the paper "dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching" (dLLM-Cache…
☆164Updated last month
Doraemonzzz / xmixers
Xmixers: A collection of SOTA efficient token/channel mixers
☆29Updated last month
hao-ai-lab / Awesome-Video-Attention
A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and cach…
☆41Updated last month
ModelTC / QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…
☆39Updated last year
sramshetty / mixture-of-depths
An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆36Updated last year
attention-survey / Efficient_Attention_Survey
A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention
☆189Updated last month
megvii-research / IntLLaMA
IntLLaMA: A fast and light quantization solution for LLaMA
☆18Updated 2 years ago
jxiw / M1
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
☆42Updated 3 months ago
BBuf / flash-rwkv
☆32Updated last year
thu-nics / R2R
[NeurIPS'25] The official code implementation for paper "R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Tok…
☆52Updated this week
LiangrunFlora / Slow-Fast-Sampling
Official PyTorch implementation of the paper "Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Princ…
☆33Updated 3 months ago
ThisisBillhe / ZipCache
[NeurIPS 2024] The official implementation of ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
☆29Updated 6 months ago
tilde-research / nsa-impl
An efficient implementation of the NSA (Native Sparse Attention) kernel
☆119Updated 3 months ago
TianjinYellow / StableSPAM
☆25Updated 6 months ago
zhixuan-lin / forgetting-transformer
[ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning
☆131Updated 3 weeks ago
OpenSparseLLMs / Linearization
☆61Updated 3 months ago
MarkXCloud / CSpD
The official repo of continuous speculative decoding
☆30Updated 6 months ago
mit-han-lab / x-attention
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆236Updated 3 months ago
leo-yangli / VB-LoRA
This repo contains the source code for VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector Banks (NeurIPS 2024).
☆42Updated last year