xlite-dev / Awesome-Diffusion-Inference

📖A curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. 🎉🎉

☆210

Alternatives and similar repositories for Awesome-Diffusion-Inference:

Users that are interested in Awesome-Diffusion-Inference are comparing it to the libraries listed below

mit-han-lab / deepcompressor
Model Compression Toolbox for Large Language Models and Diffusion Models
☆435Updated 3 weeks ago
thu-nics / DiTFastAttn
☆160Updated 3 months ago
thu-ml / SpargeAttn
SpargeAttention: A training-free sparse attention that can accelerate any model inference.
☆488Updated this week
SandAI-org / MagiAttention
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training
☆248Updated this week
mit-han-lab / Block-Sparse-Attention
A sparse attention kernel supporting mix sparse patterns
☆197Updated 2 months ago
xdit-project / DistVAE
A parallelism VAE avoids OOM for high resolution image generation
☆61Updated 3 months ago
chengzeyi / ParaAttention
https://wavespeed.ai/ Context parallel attention that accelerates DiT model inference with dynamic caching
☆243Updated 3 weeks ago
svg-project / Sparse-VideoGen
Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
☆178Updated this week
feifeibear / long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
☆477Updated this week
mit-han-lab / distrifuser
[CVPR 2024 Highlight] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
☆677Updated 4 months ago
xuyang-liu16 / Awesome-Generation-Acceleration
📚 Collection of awesome generation acceleration resources.
☆215Updated this week
thu-nics / ViDiT-Q
[ICLR'25] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
☆77Updated last month
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆184Updated last week
TencentARC / mllm-npu
mllm-npu: training multimodal large language models on Ascend NPUs
☆91Updated 7 months ago
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆270Updated 5 months ago
mit-han-lab / x-attention
XAttention: Block Sparse Attention with Antidiagonal Scoring
☆140Updated 3 weeks ago
Shenyi-Z / ToCa
Accelerating Diffusion Transformers with Token-wise Feature Caching
☆132Updated last month
microsoft / RAS
An open-source implementation of Regional Adaptive Sampling (RAS), a novel diffusion model sampling strategy that introduces regional var…
☆125Updated 2 months ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆334Updated 3 months ago
mit-han-lab / patch_conv
Patch convolution to avoid large GPU memory usage of Conv2D
☆86Updated 3 months ago
ByteDance-Seed / Triton-distributed
Distributed Triton for Parallel Systems
☆451Updated 2 weeks ago
SiriusNEO / Triton-Puzzles-Lite
Puzzles for learning Triton, play it with minimal environment configuration!
☆290Updated 4 months ago
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆44Updated 9 months ago
ByteDance-Seed / VeOmni
VeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Framework
☆297Updated 2 weeks ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆368Updated last week
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆453Updated 2 months ago
Xiuyu-Li / q-diffusion
[ICCV 2023] Q-Diffusion: Quantizing Diffusion Models.
☆348Updated last year
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆116Updated 2 weeks ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆304Updated 9 months ago
InternLM / turbomind
☆82Updated 3 weeks ago