zhengxuJosh / Awesome-Multimodal-Spatial-ReasoningLinks
This repository collects and organises state‑of‑the‑art papers on spatial reasoning for Multimodal Vision–Language Models (MVLMs).
☆210Updated last week
Alternatives and similar repositories for Awesome-Multimodal-Spatial-Reasoning
Users that are interested in Awesome-Multimodal-Spatial-Reasoning are comparing it to the libraries listed below
Sorting:
- NEO Series: Native Vision-Language Models from First Principles☆223Updated 3 weeks ago
- ☆134Updated last week
- Official implementation of "Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs".☆92Updated last week
- ☆173Updated 3 months ago
- A Scientific Multimodal Foundation Model☆604Updated last month
- [ACL2025 Oral & Award] Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexible☆107Updated 3 months ago
- The offical repo for "Parallel-R1: Towards Parallel Thinking via Reinforcement Learning"☆231Updated this week
- D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI☆53Updated 3 weeks ago
- OpenVLThinker: An Early Exploration to Vision-Language Reasoning via Iterative Self-Improvement☆118Updated 3 months ago
- StreamingVLM: Real-Time Understanding for Infinite Video Streams☆696Updated last month
- OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.☆510Updated 2 weeks ago
- A Curated List of Awesome Works in World Modeling, Aiming to Serve as a One-stop Resource for Researchers, Practitioners, and Enthusiasts…☆782Updated this week
- ☆183Updated 5 months ago
- Visual Planning: Let's Think Only with Images☆279Updated 5 months ago
- codes for R-Zero: Self-Evolving Reasoning LLM from Zero Data (https://www.arxiv.org/pdf/2508.05004)☆667Updated 2 weeks ago
- A minimal implementation of DeepMind's Genie world model☆1,027Updated last week
- Native Multimodal Models are World Learners☆1,230Updated this week
- [ICCV 2025] LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion☆289Updated 4 months ago
- Generate large-scale explorable 3D scenes with high-quality panorama videos from a single image or text prompt.☆561Updated last month
- An official implementation of "CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning"☆140Updated last week
- Cosmos-Transfer1-DiffusionRenderer: High-quality video de-lighting and re-lighting based on Cosmos video diffusion framework☆747Updated last month
- ☆325Updated 3 months ago
- Feed-forward model for predicting 3D physics with 3DGS + NeRF☆236Updated 2 months ago
- One-shot and Few-shot 3D Editing without Per-Scene Optimization☆160Updated 2 months ago
- ☆53Updated 6 months ago
- This is the official Python version of Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play.☆99Updated 3 weeks ago
- UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning☆150Updated 5 months ago
- Official implementation of "PyVision: Agentic Vision with Dynamic Tooling."☆133Updated 3 months ago
- Fully Open Framework for Democratized Multimodal Training☆610Updated last week
- [NeurIPS 2025] Thinkless: LLM Learns When to Think☆242Updated last month