thu-ml / SageAttention
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
☆900Updated this week
Alternatives and similar repositories for SageAttention:
Users that are interested in SageAttention are comparing it to the libraries listed below
- xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism☆1,195Updated last week
- [ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models☆612Updated last week
- [CVPR 2024 Highlight] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models☆649Updated last month
- Model Compression Toolbox for Large Language Models and Diffusion Models☆309Updated last month
- 📖A curated list of Awesome Diffusion Inference Papers with codes, such as Sampling, Caching, Multi-GPUs, etc. 🎉🎉☆176Updated 2 weeks ago
- Context parallel attention that accelerates DiT model inference with dynamic caching☆165Updated this week
- End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training).☆312Updated 3 weeks ago
- HART: Efficient Visual Generation with Hybrid Autoregressive Transformer☆411Updated 3 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆415Updated last month
- Efficient LLM Inference over Long Sequences☆349Updated last month
- DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆420Updated last week
- FastVideo is a lightweight framework for accelerating large video diffusion models.☆945Updated this week
- Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis☆902Updated last week
- [NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which r…☆891Updated last week
- [CVPR 2024] DeepCache: Accelerating Diffusion Models for Free☆841Updated 7 months ago
- Ring attention implementation with flash attention☆660Updated last month
- Official Implementation of "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraini…☆538Updated 5 months ago
- Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.☆1,215Updated 2 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆690Updated 4 months ago
- Enhance-A-Video: Better Generated Video for Free☆264Updated last month
- Next-Token Prediction is All You Need☆1,977Updated 3 months ago
- Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)☆929Updated 3 weeks ago
- FlashInfer: Kernel Library for LLM Serving☆1,887Updated this week
- [ICLR 2025] Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.☆1,142Updated last week
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation☆687Updated 3 months ago
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆492Updated this week
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding☆236Updated 4 months ago
- Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.☆322Updated 2 months ago
- ☆241Updated last month
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆572Updated last week