DeepAuto-AI / hip-attentionLinks
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆141Updated this week
Alternatives and similar repositories for hip-attention
Users that are interested in hip-attention are comparing it to the libraries listed below
Sorting:
- Work in progress.☆70Updated last month
- ☆199Updated 8 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆198Updated last year
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆160Updated 3 months ago
- This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.☆15Updated this week
- ☆38Updated 9 months ago
- ☆145Updated last month
- ☆83Updated 6 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆36Updated 2 weeks ago
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆82Updated 2 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆127Updated 8 months ago
- ☆127Updated last year
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆244Updated 6 months ago
- ☆51Updated 9 months ago
- PB-LLM: Partially Binarized Large Language Models☆153Updated last year
- EvaByte: Efficient Byte-level Language Models at Scale☆103Updated 3 months ago
- ☆68Updated last year
- PyTorch implementation of models from the Zamba2 series.☆184Updated 6 months ago
- RWKV-7: Surpassing GPT☆94Updated 8 months ago
- ☆75Updated last month
- Efficient LLM Inference over Long Sequences☆385Updated last month
- KV cache compression for high-throughput LLM inference☆134Updated 6 months ago
- Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This rep…☆59Updated 9 months ago
- QuIP quantization☆54Updated last year
- ☆137Updated 5 months ago
- The evaluation framework for training-free sparse attention in LLMs☆86Updated last month
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆61Updated 9 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆288Updated 2 months ago
- Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆95Updated last week
- Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)☆39Updated last month