SHI-Labs / IMG-Multimodal-Diffusion-AlignmentLinks
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance, ICCV 2025
☆30Updated 4 months ago
Alternatives and similar repositories for IMG-Multimodal-Diffusion-Alignment
Users that are interested in IMG-Multimodal-Diffusion-Alignment are comparing it to the libraries listed below
Sorting:
- Official repository of Vision Test-Time Training☆49Updated 2 months ago
- [NeurIPS 2024] ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis☆25Updated last year
- ☆117Updated 6 months ago
- ☆122Updated 3 months ago
- Holistic Evaluation of Multimodal LLMs on Spatial Intelligence☆79Updated this week
- Official repository for "iVideoGPT: Interactive VideoGPTs are Scalable World Models" (NeurIPS 2024), https://arxiv.org/abs/2405.15223☆164Updated 4 months ago
- Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025)☆240Updated 6 months ago
- MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence☆54Updated last month
- Thinking with Videos from Open-Source Priors. We reproduce chain-of-frames visual reasoning by fine-tuning open-source video models. Give…☆207Updated 3 months ago
- [ICML'25] The PyTorch implementation of paper: "AdaWorld: Learning Adaptable World Models with Latent Actions".☆196Updated 7 months ago
- MetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, …☆203Updated 9 months ago
- Official repository for "Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models", https://arxiv.org/abs/2601.1983…☆64Updated last week
- [ICLR'26] Official PyTorch implementation of "Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models".☆59Updated this week
- [ICLR 2026] MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence☆76Updated last week
- CODA: Repurposing Continuous VAEs for Discrete Tokenization☆35Updated 7 months ago
- [ICLR 2025] Official implementation and benchmark evaluation repository of <PhysBench: Benchmarking and Enhancing Vision-Language Models …☆83Updated 2 weeks ago
- The open-source code for the NeurIPS 2025 paper, "Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learn…☆40Updated last month
- ☆163Updated last year
- We introduce 'Thinking with Video', a new paradigm leveraging video generation for multimodal reasoning. Our VideoThinkBench shows that S…☆237Updated last week
- Official repository for "RLVR-World: Training World Models with Reinforcement Learning" (NeurIPS 2025), https://arxiv.org/abs/2505.13934☆208Updated 3 months ago
- [ICCV2025 Oral] Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos☆162Updated 4 months ago
- [ECCV 2024] AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation☆35Updated last year
- [NeurIPS 2025] OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding☆70Updated 4 months ago
- PyTorch implementation of NEPA☆308Updated 2 weeks ago
- [World-Model-Survey-2024] Paper list and projects for World Model☆15Updated last year
- ☆184Updated last week
- Cambrian-S: Towards Spatial Supersensing in Video☆488Updated last month
- [NeurIPS 2025] Official Repo of Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration☆113Updated 2 months ago
- [ICCV'25] Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness☆64Updated 6 months ago
- [NeurIPS-2024] The offical Implementation of "Instruction-Guided Visual Masking"☆40Updated last year