ByteDance-Seed/FlexPrefill

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ByteDance-Seed/FlexPrefill)

ByteDance-Seed / FlexPrefill

Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

☆170

Alternatives and similar repositories for FlexPrefill

Users that are interested in FlexPrefill are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

mit-han-lab / x-attention
View on GitHub
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆280Jul 6, 2025Updated last year
microsoft / SeerAttention
View on GitHub
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
☆213Jul 10, 2026Updated last week
XunhaoLai / native-sparse-attention-triton
View on GitHub
Efficient triton implementation of Native Sparse Attention.
☆284May 23, 2025Updated last year
mit-han-lab / Block-Sparse-Attention
View on GitHub
A sparse attention kernel supporting mix sparse patterns
☆535Jan 18, 2026Updated 6 months ago
mit-han-lab / Quest
View on GitHub
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆400Jul 10, 2025Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
microsoft / MInference
View on GitHub
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,221Apr 8, 2026Updated 3 months ago
ByteDance-Seed / ShadowKV
View on GitHub
[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
☆310May 1, 2025Updated last year
thu-nics / MoA
View on GitHub
[CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
☆159Jan 14, 2026Updated 6 months ago
PiotrNawrot / sparse-frontier
View on GitHub
The evaluation framework for training-free sparse attention in LLMs
☆127Jan 27, 2026Updated 5 months ago
xinghaow99 / pbs-attn
View on GitHub
[ICML 2026] Sparser Block-Sparse Attention via Token Permutation
☆31May 22, 2026Updated last month
Infini-AI-Lab / MagicPIG
View on GitHub
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
☆255Dec 16, 2024Updated last year
thu-ml / SpargeAttn
View on GitHub
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
☆1,017Feb 25, 2026Updated 4 months ago
FasterDecoding / SnapKV
View on GitHub
☆324Jul 10, 2025Updated last year
mit-han-lab / duo-attention
View on GitHub
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆539Feb 10, 2025Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
OpenBMB / infllmv2_cuda_impl
View on GitHub
☆102Feb 11, 2026Updated 5 months ago
chenyu-jiang / dcp
View on GitHub
Code repository for the SOSP'25 paper DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism.
☆21Nov 28, 2025Updated 7 months ago
sgl-project / sgl-flash-attn
View on GitHub
Fast and memory-efficient exact attention
☆22Jun 26, 2026Updated 3 weeks ago
attention-survey / Efficient_Attention_Survey
View on GitHub
A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention
☆304Dec 1, 2025Updated 7 months ago
qhfan / FlashPrefill
View on GitHub
Implementation of "FlashPreill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling"
☆53Apr 27, 2026Updated 2 months ago
October2001 / Awesome-KV-Cache-Compression
View on GitHub
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
☆725Apr 15, 2026Updated 3 months ago
FFY0 / AdaKV
View on GitHub
The Official Implementation of Ada-KV [NeurIPS 2025]
☆139Nov 26, 2025Updated 7 months ago
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated last year
mit-han-lab / flash-moba
View on GitHub
☆250Nov 19, 2025Updated 8 months ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
snu-mllab / KVzip
View on GitHub
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
☆224Feb 11, 2026Updated 5 months ago
FMInference / H2O
View on GitHub
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
☆528Aug 1, 2024Updated last year
fla-org / native-sparse-attention
View on GitHub
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
☆1,012Feb 5, 2026Updated 5 months ago
Infini-AI-Lab / MagicDec
View on GitHub
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆154Dec 4, 2024Updated last year
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆418Nov 20, 2025Updated 8 months ago
sail-sg / SimLayerKV
View on GitHub
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
☆54Oct 18, 2024Updated last year
Dao-AILab / sonic-moe
View on GitHub
Accelerating MoE with IO and Tile-aware Optimizations
☆732Jul 4, 2026Updated 2 weeks ago
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆850Mar 6, 2025Updated last year
tilde-research / nsa-release
View on GitHub
An efficient implementation of the NSA (Native Sparse Attention) kernel
☆133Jun 24, 2025Updated last year
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
thunlp / InfLLM
View on GitHub
The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Mem…
☆405Apr 20, 2024Updated 2 years ago
SUSTechBruce / LOOK-M
View on GitHub
[EMNLP 2024 Findings🔥] Official implementation of ": LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context In…
☆103Nov 9, 2024Updated last year
NathanGodey / qfilters
View on GitHub
Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)
☆34Mar 7, 2025Updated last year
feifeibear / ChituAttention
View on GitHub
Quantized Attention on GPU
☆45Nov 22, 2024Updated last year
antgroup / cakekv
View on GitHub
☆39Mar 17, 2025Updated last year
IsaacRe / vllm-kvcompress
View on GitHub
KV cache compression for high-throughput LLM inference
☆158Feb 5, 2025Updated last year
DeepAuto-AI / sglang
View on GitHub
This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.
☆18Mar 31, 2026Updated 3 months ago