ShoufaChen / Awesome-Diffusion-Transformers
https://www.shoufachen.com/Awesome-Diffusion-Transformers/
โ106Updated 6 months ago
Related projects: โ
- ๐ This is a repository for organizing papers, codes and other resources related to unified multimodal models.โ134Updated last week
- VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Modelsโ93Updated last month
- A list for Text-to-Video, Image-to-Video worksโ167Updated last month
- Scaling Diffusion Transformers with Mixture of Expertsโ178Updated last week
- โ147Updated last year
- STAR: Scale-wise Text-to-image generation via Auto-Regressive representationsโ107Updated 3 months ago
- Scaling RWKV-Like Architectures for Diffusion Modelsโ110Updated 5 months ago
- [ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creationโ378Updated 5 months ago
- โ168Updated 2 months ago
- CV-VAE: A Compatible Video VAE for Latent Generative Video Modelsโ210Updated 2 weeks ago
- UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editingโ87Updated 5 months ago
- โ99Updated 6 months ago
- โ113Updated 2 months ago
- [ICML 2024 Spotlight] FiT: Flexible Vision Transformer for Diffusion Modelโ357Updated 7 months ago
- ๐ฅ [CVPR2024] Official implementation of "Self-correcting LLM-controlled Diffusion Models (SLD)โ146Updated 5 months ago
- [ICLR2024] The official implementation of paper "VDT: General-purpose Video Diffusion Transformers via Mask Modeling", by Haoyu Lu, Guoxiโฆโ205Updated 4 months ago
- Precision Search through Multi-Style Inputsโ45Updated last month
- An in-context conditioning version of MUSE with pre-trained checkpoints.โ105Updated last year
- ๐ฅ๐ฅ๐ฅ A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).โ293Updated 3 weeks ago
- My implementation of "Patch nโ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"โ168Updated last week
- ๆฉๆฃๆจกๅ็ฎๆณๅบ็กๆๆกฃใ่ฎญ็ปใๅฎ้ชใ้จ็ฝฒ็ญไปๅบโ26Updated 3 months ago
- EVE: Encoder-Free Vision-Language Modelsโ207Updated 2 months ago
- Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"โ103Updated last month
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"โ105Updated last month
- Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptionsโ125Updated last month
- The HD-VG-130M Datasetโ106Updated 5 months ago
- โ93Updated 2 months ago
- ๐ฅstable, simple, state-of-the-art VQVAE toolkit & cookbookโ34Updated 2 months ago
- SpeeD: A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Trainingโ148Updated 2 months ago
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizerโ189Updated 5 months ago