66RING / LongShortTokenDecodingLinks

Long short token decoding speed up 4x for long context LLM. A hundred lines of core code. Open source for learning.

☆8

Alternatives and similar repositories for LongShortTokenDecoding

Users that are interested in LongShortTokenDecoding are comparing it to the libraries listed below

Sorting:

DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆44Updated this week
zyxxmu / cam
Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference
☆42Updated last year
FFY0 / AdaKV
The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
☆87Updated last month
NonvolatileMemory / flash_tree_attn
☆19Updated 7 months ago
hyx1999 / SAM-Decoding
Official Implementation of SAM-Decoding: Speculative Decoding via Suffix Automaton
☆30Updated 5 months ago
SqueezeAILab / SqueezedAttention
SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference
☆50Updated 8 months ago
NJUNLP / MCSD
Multi-Candidate Speculative Decoding
☆36Updated last year
machilusZ / FastGen
This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
☆38Updated 11 months ago
LiuXiaoxuanPKU / OSD
☆54Updated 8 months ago
microsoft / chunk-attention
☆78Updated 3 months ago
IST-DASLab / HALO
HALO: Hadamard-Assisted Low-Precision Optimization and Training method for finetuning LLMs. 🚀 The official implementation of https://arx…
☆18Updated 5 months ago
Equationliu / Kangaroo
[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…
☆59Updated last year
smart-lty / ParallelSpeculativeDecoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆102Updated 3 months ago
cat538 / SKVQ
[COLM 2024] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
☆23Updated 10 months ago
ClubieDong / QAQ-KVCacheQuantization
QAQ: Quality Adaptive Quantization for LLM KV Cache
☆52Updated last year
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆103Updated 4 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆124Updated 2 months ago
d-matrix-ai / keyformer-llm
☆54Updated last year
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 8 months ago
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆42Updated last month
shadowpa0327 / Palu
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆128Updated 5 months ago
Dominic789654 / LongGenBench
Source code for the paper "LongGenBench: Long-context Generation Benchmark"
☆22Updated 10 months ago
tile-ai / AttentionEngine
☆50Updated 2 months ago
Jingyu6 / speculative_prefill
☆37Updated 2 months ago
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆51Updated 9 months ago
BaiTheBest / SparseLLM
Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)
☆64Updated 4 months ago
aiha-lab / MX-QLLM
LLM Inference with Microscaling Format
☆25Updated 8 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆165Updated last year
LINs-lab / DeFT
[ICLR 2025] DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference
☆31Updated last month
tsinghua-ideal / Twilight
Adaptive Attention Sparsity with Hierarchical Top-p Pruning
☆19Updated 5 months ago