DeepAuto-AI / hip-attentionLinks
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆136Updated last week
Alternatives and similar repositories for hip-attention
Users that are interested in hip-attention are comparing it to the libraries listed below
Sorting:
- Work in progress.☆69Updated 2 weeks ago
- ☆126Updated last year
- This is a fork of SGLang for hip-attention integration. Please refer to hip-attention for detail.☆14Updated this week
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆116Updated 6 months ago
- ☆37Updated 8 months ago
- ☆80Updated 5 months ago
- The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".☆71Updated 3 months ago
- PB-LLM: Partially Binarized Large Language Models☆152Updated last year
- Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆37Updated 4 months ago
- ☆130Updated 4 months ago
- ☆198Updated 6 months ago
- ☆137Updated this week
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆198Updated 11 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆158Updated 2 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆98Updated 8 months ago
- KV cache compression for high-throughput LLM inference☆131Updated 4 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆163Updated 11 months ago
- ☆68Updated this week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆113Updated last month
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆108Updated 2 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆239Updated 4 months ago
- Efficient LLM Inference over Long Sequences☆378Updated 3 weeks ago
- The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆134Updated 3 weeks ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆117Updated last year
- ☆51Updated 7 months ago
- QuIP quantization☆54Updated last year
- ☆38Updated this week
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆213Updated 7 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆273Updated last month
- A repository for research on medium sized language models.☆76Updated last year