SHI-Labs / IMG-Multimodal-Diffusion-AlignmentLinks
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance, ICCV 2025
☆28Updated last week
Alternatives and similar repositories for IMG-Multimodal-Diffusion-Alignment
Users that are interested in IMG-Multimodal-Diffusion-Alignment are comparing it to the libraries listed below
Sorting:
- [NeurIPS 2024] ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis☆24Updated 10 months ago
- CODA: Repurposing Continuous VAEs for Discrete Tokenization☆28Updated 3 months ago
- ☆52Updated last month
- Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025)☆168Updated 2 months ago
- [ECCV 2024] Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators☆45Updated last year
- ☆88Updated 2 months ago
- [ECCV 2024] AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation☆34Updated last year
- [ICCV 2025 Oral] Official implementation of Learning Streaming Video Representation via Multitask Training.☆56Updated 2 weeks ago
- Official implementation of ECCV 2024 paper: Take A Step Back: Rethinking the Two Stages in Visual Reasoning☆15Updated 4 months ago
- TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation☆218Updated last month
- [NeurIPS 2025] Official Repo of Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration☆82Updated 4 months ago
- [ICLR'25] Reconstructive Visual Instruction Tuning☆119Updated 6 months ago
- [NeurIPS 2024] Official Repository of Multi-Object Hallucination in Vision-Language Models☆31Updated 10 months ago
- ☆13Updated 9 months ago
- [NIPS 2025 DB Oral] Official Repository of paper: Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing☆99Updated 3 weeks ago
- [CVPR 2025 (Oral)] Open implementation of "RandAR"☆196Updated 2 months ago
- MetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, …☆189Updated 5 months ago
- [ICML2025] The code and data of Paper: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation☆124Updated 11 months ago
- Code for MetaMorph Multimodal Understanding and Generation via Instruction Tuning☆212Updated 5 months ago
- TStar is a unified temporal search framework for long-form video question answering☆68Updated last month
- ☆17Updated 7 months ago
- Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning☆122Updated last month
- A collection of vision foundation models unifying understanding and generation.☆55Updated 9 months ago
- [NeurIPS 2025] VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models☆67Updated this week
- ☆32Updated last month
- Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision☆146Updated 2 weeks ago
- A PyTorch implementation of the paper "Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis"☆46Updated last year
- ☆30Updated 10 months ago
- [CVPR 2025 Highlight] Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding☆37Updated last month
- WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation☆152Updated last week