mit-han-lab / x-attention
XAttention: Block Sparse Attention with Antidiagonal Scoring
☆137Updated 2 weeks ago
Alternatives and similar repositories for x-attention:
Users that are interested in x-attention are comparing it to the libraries listed below
- A sparse attention kernel supporting mix sparse patterns☆186Updated 2 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆61Updated 5 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆82Updated last week
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆179Updated this week
- An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization☆125Updated 2 months ago
- ☆156Updated 3 months ago
- Efficient triton implementation of Native Sparse Attention.☆135Updated this week
- 16-fold memory access reduction with nearly no loss☆88Updated 2 weeks ago
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection☆98Updated last month
- The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆123Updated 4 months ago
- ☆75Updated 3 weeks ago
- 🔥 A minimal training framework for scaling FLA models☆97Updated last week
- ☆122Updated 2 months ago
- Efficient 2:4 sparse training algorithms and implementations☆54Updated 4 months ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆46Updated 4 months ago
- A WebUI for Side-by-Side Comparison of Media (Images/Videos) Across Multiple Folders☆22Updated last month
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆269Updated 4 months ago
- SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆19Updated 6 months ago
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆37Updated 9 months ago
- [ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…☆64Updated last year
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆29Updated 4 months ago
- SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models☆28Updated 8 months ago
- VeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Framework☆285Updated last week
- [NeurIPS 2024] Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching☆100Updated 9 months ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆44Updated 8 months ago
- ☆67Updated last month
- ☆74Updated 3 weeks ago
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs☆97Updated last week
- ☆39Updated 8 months ago
- [ICLR'25] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation☆74Updated 3 weeks ago