SHI-Labs / IMG-Multimodal-Diffusion-AlignmentLinks
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance, ICCV 2025
☆30Updated 3 months ago
Alternatives and similar repositories for IMG-Multimodal-Diffusion-Alignment
Users that are interested in IMG-Multimodal-Diffusion-Alignment are comparing it to the libraries listed below
Sorting:
- Official repository of Vision Test-Time Training☆48Updated last month
- [NeurIPS 2024] ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis☆25Updated last year
- Official repository for "iVideoGPT: Interactive VideoGPTs are Scalable World Models" (NeurIPS 2024), https://arxiv.org/abs/2405.15223☆162Updated 3 months ago
- [ICML'25] The PyTorch implementation of paper: "AdaWorld: Learning Adaptable World Models with Latent Actions".☆190Updated 7 months ago
- MetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, …☆200Updated 8 months ago
- ☆116Updated 2 months ago
- ☆113Updated 5 months ago
- Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025)☆231Updated 5 months ago
- MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence☆51Updated last week
- Official eval code for ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation☆27Updated last month
- Thinking with Videos from Open-Source Priors. We reproduce chain-of-frames visual reasoning by fine-tuning open-source video models. Give…☆203Updated 3 months ago
- CODA: Repurposing Continuous VAEs for Discrete Tokenization☆35Updated 6 months ago
- [ICLR 2025] Official implementation and benchmark evaluation repository of <PhysBench: Benchmarking and Enhancing Vision-Language Models …☆83Updated 7 months ago
- ☆58Updated 4 months ago
- Cambrian-S: Towards Spatial Supersensing in Video☆475Updated 3 weeks ago
- [Nature Machine Intelligence 2025] Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception☆125Updated last month
- ☆162Updated last year
- [arXiv 2025] MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence☆70Updated this week
- [NeurIPS 2025] OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding☆69Updated 3 months ago
- The open-source code for the NeurIPS 2025 paper, "Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learn…☆32Updated 2 weeks ago
- (NeurIPS 2025 D&B Track) OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps☆23Updated 2 months ago
- Official repository for "Vid2World: Crafting Video Diffusion Models to Interactive World Models", https://arxiv.org/abs/2505.14357☆26Updated 3 weeks ago
- Thinking in 360°: Humanoid Visual Search in the Wild☆107Updated last month
- [ICML2025] The code and data of Paper: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation☆142Updated last year
- Implementation of VLM4VLA☆33Updated last week
- Holistic Evaluation of Multimodal LLMs on Spatial Intelligence☆60Updated last week
- Official repository for "RLVR-World: Training World Models with Reinforcement Learning" (NeurIPS 2025), https://arxiv.org/abs/2505.13934☆188Updated 2 months ago
- We introduce 'Thinking with Video', a new paradigm leveraging video generation for multimodal reasoning. Our VideoThinkBench shows that S…☆235Updated last week
- [ICLR 2025 Spotlight] Grounding Video Models to Actions through Goal Conditioned Exploration☆59Updated 8 months ago
- Official implementation of "Self-Improving Video Generation"☆76Updated 8 months ago