vision-x-nyu / thinking-in-space
Official repo and evaluation implementation of VSI-Bench
β481Updated 2 months ago
Alternatives and similar repositories for thinking-in-space
Users that are interested in thinking-in-space are comparing it to the libraries listed below
Sorting:
- [NeurIPS'24] This repository is the implementation of "SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models"β195Updated 5 months ago
- Compose multimodal datasets πΉβ371Updated 3 weeks ago
- Video-R1: Reinforcing Video Reasoning in MLLMs [π₯the first paper to explore R1 for video]β515Updated this week
- [ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generationβ315Updated 2 weeks ago
- [ICML 2024] Official code repository for 3D embodied generalist agent LEOβ437Updated 3 weeks ago
- β381Updated last year
- MetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, β¦β114Updated last week
- A Simple yet Effective Pathway to Empowering LLaVA to Understand and Interact with 3D Worldβ252Updated 5 months ago
- The official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models.β253Updated 3 months ago
- Multimodal Chain-of-Thought Reasoning: A Comprehensive Surveyβ568Updated this week
- [NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agentsβ312Updated last year
- [Survey] Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Surveyβ439Updated 3 months ago
- MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learningβ597Updated last week
- [Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought β¦β311Updated 4 months ago
- Cosmos-Reason1 models understand the physical common sense and generate appropriate embodied decisions in natural language through long cβ¦β317Updated last month
- Long Context Transfer from Language to Visionβ374Updated last month
- [CVPR 2024 & NeurIPS 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AIβ591Updated 2 months ago
- Official repository for VisionZip (CVPR 2025)β278Updated 2 months ago
- [ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creationβ439Updated 5 months ago
- π₯π₯π₯ A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).β473Updated last month
- Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"β239Updated 2 weeks ago
- Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"β341Updated last month
- OpenEQA Embodied Question Answering in the Era of Foundation Modelsβ281Updated 7 months ago
- Code for "Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers" (NeurIPS 2024)β164Updated last month
- β¨First Open-Source R1-like Video-LLM [2025/02/18]β335Updated 2 months ago
- This is the first paper to explore how to effectively use RL for MLLMs and introduce Vision-R1, a reasoning MLLM that leverages cold-staβ¦β559Updated last week
- [CVPR 2025] EgoLife: Towards Egocentric Life Assistantβ278Updated last month
- Official code implementation of Perception R1: Pioneering Perception Policy with Reinforcement Learningβ167Updated 3 weeks ago
- π This is a repository for organizing papers, codes and other resources related to unified multimodal models.β538Updated last month
- EVE Series: Encoder-Free Vision-Language Models from BAAIβ326Updated 2 months ago