mini-sora / MiniSora-DiT
minisora-DiT, a DiT reproduction based on XTuner from the open source community MiniSora
☆39Updated 10 months ago
Alternatives and similar repositories for MiniSora-DiT:
Users that are interested in MiniSora-DiT are comparing it to the libraries listed below
- [ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark☆66Updated last week
- A light-weight and high-efficient training framework for accelerating diffusion tasks.☆45Updated 4 months ago
- LLaVA combines with Magvit Image tokenizer, training MLLM without an Vision Encoder. Unifying image understanding and generation.☆35Updated 7 months ago
- The official implementation of PAR: Parallelized Autoregressive Visual Generation. https://epiphqny.github.io/PAR-project/☆108Updated 3 weeks ago
- ☆47Updated last month
- Video dataset dedicated to portrait-mode video recognition.☆43Updated last month
- A Framework for Decoupling and Assessing the Capabilities of VLMs☆40Updated 7 months ago
- Implementation of SmoothCache, a project aimed at speeding-up Diffusion Transformer (DiT) based GenAI models with error-guided caching.☆37Updated this week
- T2VScore: Towards A Better Metric for Text-to-Video Generation☆78Updated 9 months ago
- Code Release for the paper "Make-A-Story: Visual Memory Conditioned Consistent Story Generation" in CVPR 2023☆37Updated last year
- ☆44Updated last month
- ☆28Updated last week
- Adaptive Caching for Faster Video Generation with Diffusion Transformers☆139Updated 2 months ago
- [NeurIPS 2024 D&B Track] Official Repo for "LVD-2M: A Long-take Video Dataset with Temporally Dense Captions"☆45Updated 3 months ago
- Multimodal Open-O1 (MO1) is designed to enhance the accuracy of inference models by utilizing a novel prompt-based approach. This tool wo…☆29Updated 4 months ago
- ☆55Updated last month
- ☆133Updated 2 weeks ago
- [ NeurIPS 2024 D&B Track ] Implementation for "FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models"☆66Updated last month
- [NeurIPS 2024] Efficient Multi-modal Models via Stage-wise Visual Context Compression☆51Updated 5 months ago
- ☆100Updated 7 months ago
- TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation☆29Updated 2 months ago
- Blending Custom Photos with Video Diffusion Transformers☆40Updated last week
- Code release for our paper "Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation".☆18Updated last year
- The HD-VG-130M Dataset☆114Updated 9 months ago
- Explore the Limits of Omni-modal Pretraining at Scale☆96Updated 4 months ago
- ☆19Updated last year
- ☆40Updated 6 months ago
- Official implementation of MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis☆83Updated 6 months ago
- Inference-only implementation of "One-Step Diffusion Distillation through Score Implicit Matching" [NIPS 2024]☆77Updated 2 months ago
- Implementation of Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding☆28Updated 2 months ago