microsoft/chunk-attention

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/microsoft/chunk-attention)

microsoft / chunk-attention

☆89

Alternatives and similar repositories for chunk-attention

Users that are interested in chunk-attention are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

microsoft / vattention
View on GitHub
Dynamic Memory Management for Serving LLMs without PagedAttention
☆504Updated this week
rayleizhu / vllm-ra
View on GitHub
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆39Feb 29, 2024Updated 2 years ago
October2001 / Awesome-KV-Cache-Compression
View on GitHub
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
☆726Apr 15, 2026Updated 3 months ago
Infini-AI-Lab / TriForce
View on GitHub
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
☆281Aug 31, 2024Updated last year
LoongServe / LoongServe
View on GitHub
☆135Nov 11, 2024Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
microsoft / ParrotServe
View on GitHub
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆222Sep 21, 2024Updated last year
microsoft / sarathi-serve
View on GitHub
A low-latency & high-throughput serving engine for LLMs
☆512Jan 8, 2026Updated 6 months ago
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated last year
Adaxry / Unified_Layer_Skipping
View on GitHub
☆15Apr 11, 2024Updated 2 years ago
LLMServe / DistServe
View on GitHub
Disaggregated serving system for Large Language Models (LLMs).
☆826Apr 6, 2025Updated last year
eddiegaoo / Apt-Serve
View on GitHub
☆21Jun 9, 2025Updated last year
google / rago
View on GitHub
☆31Jun 22, 2025Updated last year
mit-han-lab / duo-attention
View on GitHub
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆539Feb 10, 2025Updated last year
zejia-lin / BulletServe
View on GitHub
Boosting GPU utilization for LLM serving via dynamic spatial-temporal prefill & decode orchestration
☆53Jan 8, 2026Updated 6 months ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
llumnix-project / llumnix-ray
View on GitHub
Efficient and easy multi-instance LLM serving
☆563Mar 12, 2026Updated 4 months ago
ByteDance-Seed / FlexPrefill
View on GitHub
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆170Oct 13, 2025Updated 9 months ago
UChi-JCL / CacheGen
View on GitHub
☆169Oct 9, 2024Updated last year
ByteDance-Seed / ShadowKV
View on GitHub
[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
☆310May 1, 2025Updated last year
andy-yang-1 / DoubleSparse
View on GitHub
16-fold memory access reduction with nearly no loss
☆107Mar 26, 2025Updated last year
AniZpZ / AutoSmoothQuant
View on GitHub
An easy-to-use package for implementing SmoothQuant for LLMs
☆111Apr 7, 2025Updated last year
wu-kan / wuk_cupti_wrapper
View on GitHub
a simple API to use CUPTI
☆10Aug 19, 2025Updated 11 months ago
efeslab / Atom
View on GitHub
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆344Jul 2, 2024Updated 2 years ago
NVIDIA / Star-Attention
View on GitHub
Efficient LLM Inference over Long Sequences
☆392Jun 25, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
NonvolatileMemory / flash_tree_attn
View on GitHub
☆20Dec 24, 2024Updated last year
Infini-AI-Lab / MagicDec
View on GitHub
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆154Dec 4, 2024Updated last year
sail-sg / SimLayerKV
View on GitHub
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
☆54Oct 18, 2024Updated last year
FasterDecoding / TEAL
View on GitHub
☆167Feb 15, 2025Updated last year
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆419Nov 20, 2025Updated 8 months ago
feifeibear / long-context-attention
View on GitHub
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
☆681May 21, 2026Updated 2 months ago
EfficientLLMSys / MuxServe
View on GitHub
☆15Jun 26, 2024Updated 2 years ago
efeslab / Nanoflow
View on GitHub
A throughput-oriented high-performance serving framework for LLMs
☆968Mar 29, 2026Updated 3 months ago
Zanette-Labs / SpeculativeRejection
View on GitHub
[NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection
☆56Oct 29, 2024Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
machilusZ / FastGen
View on GitHub
This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
☆44Aug 14, 2024Updated last year
HPMLL / ZipServ_ASPLOS26
View on GitHub
☆50Dec 19, 2025Updated 7 months ago
Equationliu / Kangaroo
View on GitHub
[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…
☆72Jun 26, 2024Updated 2 years ago
JingyangXiang / DFRot
View on GitHub
[COLM 2025] DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation; 知乎：https://zhuanlan.zhihu.c…
☆30Mar 5, 2025Updated last year
feifeibear / ChituAttention
View on GitHub
Quantized Attention on GPU
☆45Nov 22, 2024Updated last year
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆850Mar 6, 2025Updated last year
PanZaifeng / FastTree-Artifact
View on GitHub
☆32Mar 24, 2025Updated last year