DeepAuto-AI / hip-attentionLinks

Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.

☆148

Alternatives and similar repositories for hip-attention

Users that are interested in hip-attention are comparing it to the libraries listed below

Sorting:

IST-DASLab / QuEST
Work in progress.
☆75Updated 4 months ago
snu-mllab / KVzip
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
☆146Updated 2 weeks ago
itsnamgyu / block-transformer
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
☆162Updated 7 months ago
SalesforceAIResearch / GemFilter
☆85Updated this week
DeepAuto-AI / sglang
This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.
☆18Updated last month
FasterDecoding / BitDelta
☆202Updated 11 months ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆102Updated last month
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆202Updated last year
HanGuo97 / lq-lora
☆127Updated last year
hetailang / SqueezeAttention
☆38Updated last year
Cornell-RelaxML / yaqa-quantization
☆62Updated 4 months ago
IST-DASLab / Quartet
☆106Updated 2 weeks ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆390Updated 4 months ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆60Updated last year
amazon-science / mxfp4-llm
Official implementation for Training LLMs with MXFP4
☆102Updated 6 months ago
Cornell-RelaxML / qtip
☆153Updated 4 months ago
tiiuae / onebitllms
Lightweight toolkit package to train and fine-tune 1.58bit Language models
☆98Updated 5 months ago
HazyResearch / lolcats
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆249Updated 9 months ago
wdlctc / mini-s
☆52Updated last year
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆143Updated 9 months ago
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated 11 months ago
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆156Updated last year
samchaineau / llm_slerp_generation
Repo hosting codes and materials related to speeding LLMs' inference using token merging.
☆37Updated last month
jeffreysijuntan / lloco
The official repo for "LLoCo: Learning Long Contexts Offline"
☆118Updated last year
sebulo / LoQT
☆80Updated last year
BorealisAI / neuzip
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This rep…
☆59Updated last year
minyoungg / LTE
☆69Updated last year
VITA-Group / WeLore
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…
☆51Updated 2 weeks ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆131Updated 11 months ago
facebookresearch / LayerSkip
Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024
☆346Updated 6 months ago