DeepAuto-AI / hip-attentionLinks
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆148Updated last week
Alternatives and similar repositories for hip-attention
Users that are interested in hip-attention are comparing it to the libraries listed below
Sorting:
- Work in progress.☆75Updated 4 months ago
- [NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆146Updated 2 weeks ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆162Updated 7 months ago
- ☆85Updated this week
- This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.☆18Updated last month
- ☆202Updated 11 months ago
- The evaluation framework for training-free sparse attention in LLMs☆102Updated last month
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆202Updated last year
- ☆127Updated last year
- ☆38Updated last year
- ☆62Updated 4 months ago
- ☆106Updated 2 weeks ago
- Efficient LLM Inference over Long Sequences☆390Updated 4 months ago
- QuIP quantization☆60Updated last year
- Official implementation for Training LLMs with MXFP4☆102Updated 6 months ago
- ☆153Updated 4 months ago
- Lightweight toolkit package to train and fine-tune 1.58bit Language models