DefTruth / Awesome-Diffusion-Inference

📖A curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. 🎉🎉

☆189

Alternatives and similar repositories for Awesome-Diffusion-Inference:

Users that are interested in Awesome-Diffusion-Inference are comparing it to the libraries listed below

mit-han-lab / deepcompressor
Model Compression Toolbox for Large Language Models and Diffusion Models
☆330Updated this week
thu-nics / DiTFastAttn
☆144Updated last month
xdit-project / DistVAE
A parallelism VAE avoids OOM for high resolution image generation
☆53Updated 3 weeks ago
thu-nics / ViDiT-Q
[ICLR'25] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
☆55Updated last week
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆260Updated last month
feifeibear / long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
☆428Updated this week
TencentARC / mllm-npu
mllm-npu: training multimodal large language models on Ascend NPUs
☆90Updated 5 months ago
mit-han-lab / Block-Sparse-Attention
A sparse attention kernel supporting mix sparse patterns
☆133Updated last week
bytedance / flux
A fast communication-overlapping library for tensor parallelism on GPUs.
☆296Updated 3 months ago
SiriusNEO / Triton-Puzzles-Lite
Puzzles for learning Triton, play it with minimal environment configuration!
☆229Updated 2 months ago
DefTruth / ffpa-attn-mma
📚FFPA: Yet another Faster Flash Prefill Attention with O(1)⚡️SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster than SDPA EA.
☆106Updated this week
mit-han-lab / distrifuser
[CVPR 2024 Highlight] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
☆654Updated 2 months ago
chengzeyi / ParaAttention
Context parallel attention that accelerates DiT model inference with dynamic caching
☆189Updated this week
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆245Updated 2 months ago
hahnyuan / LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…
☆392Updated 5 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆101Updated this week
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆512Updated this week
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆290Updated this week
xuyang-liu16 / Awesome-Generation-Acceleration
📚 Collection of awesome generation acceleration resources.
☆139Updated this week
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆240Updated 8 months ago
spcl / QuaRot
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆342Updated 2 months ago
FMInference / H2O
[NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
☆421Updated 6 months ago
mit-han-lab / nunchaku
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
☆679Updated this week
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆295Updated 7 months ago
Xiuyu-Li / q-diffusion
[ICCV 2023] Q-Diffusion: Quantizing Diffusion Models.
☆343Updated 11 months ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆116Updated 11 months ago
xdit-project / xDiT
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
☆1,276Updated last week
FlagOpen / FlagGems
FlagGems is an operator library for large language models implemented in Triton Language.
☆421Updated this week
galeselee / Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…
☆220Updated 2 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆234Updated 3 months ago