xlite-dev / Awesome-Diffusion-Inference
πA curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. ππ
β210Updated last month
Alternatives and similar repositories for Awesome-Diffusion-Inference:
Users that are interested in Awesome-Diffusion-Inference are comparing it to the libraries listed below
- Model Compression Toolbox for Large Language Models and Diffusion Modelsβ435Updated 3 weeks ago
- β160Updated 3 months ago
- SpargeAttention: A training-free sparse attention that can accelerate any model inference.β488Updated this week
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Trainingβ248Updated this week
- A sparse attention kernel supporting mix sparse patternsβ197Updated 2 months ago
- A parallelism VAE avoids OOM for high resolution image generationβ61Updated 3 months ago
- https://wavespeed.ai/ Context parallel attention that accelerates DiT model inference with dynamic cachingβ243Updated 3 weeks ago
- Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsityβ178Updated this week
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ477Updated this week
- [CVPR 2024 Highlight] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Modelsβ677Updated 4 months ago
- π Collection of awesome generation acceleration resources.β215Updated this week
- [ICLR'25] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generationβ77Updated last month
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Trainingβ184Updated last week
- mllm-npu: training multimodal large language models on Ascend NPUsβ91Updated 7 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inferenceβ270Updated 5 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoringβ140Updated 3 weeks ago
- Accelerating Diffusion Transformers with Token-wise Feature Cachingβ132Updated last month
- An open-source implementation of Regional Adaptive Sampling (RAS), a novel diffusion model sampling strategy that introduces regional varβ¦β125Updated 2 months ago
- flash attention tutorial written in python, triton, cuda, cutlassβ334Updated 3 months ago
- Patch convolution to avoid large GPU memory usage of Conv2Dβ86Updated 3 months ago
- Distributed Triton for Parallel Systemsβ451Updated 2 weeks ago
- Puzzles for learning Triton, play it with minimal environment configuration!β290Updated 4 months ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clustersβ44Updated 9 months ago
- VeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Frameworkβ297Updated 2 weeks ago
- Efficient LLM Inference over Long Sequencesβ368Updated last week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ453Updated 2 months ago
- [ICCV 2023] Q-Diffusion: Quantizing Diffusion Models.β348Updated last year
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.β116Updated 2 weeks ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Servingβ304Updated 9 months ago
- β82Updated 3 weeks ago