vision-x-nyu / thinking-in-spaceLinks

Official repo and evaluation implementation of VSI-Bench

☆560

Alternatives and similar repositories for thinking-in-space

Users that are interested in thinking-in-space are comparing it to the libraries listed below

Sorting:

remyxai / VQASynth
Compose multimodal datasets 🎹
☆451Updated last week
AnjieCheng / SpatialRGPT
[NeurIPS'24] This repository is the implementation of "SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models"
☆226Updated 7 months ago
tulerfeng / Video-R1
Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]
☆646Updated last week
PzySeere / MetaSpatial
MetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, …
☆162Updated 3 months ago
embodied-generalist / embodied-generalist
[ICML 2024] Official code repository for 3D embodied generalist agent LEO
☆448Updated 3 months ago
GigaAI-research / General-World-Models-Survey
☆414Updated last year
mit-han-lab / vila-u
[ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
☆374Updated 3 months ago
yyyybq / Awesome-Spatial-Reasoning
A paper list for spatial reasoning
☆127Updated last month
diankun-wu / Spatial-MLLM
Official implementation of Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
☆311Updated last month
NVlabs / Long-RL
Long-RL: Scaling RL to Long Sequences
☆568Updated this week
ZCMax / LLaVA-3D
[ICCV 2025] A Simple yet Effective Pathway to Empowering LLaVA to Understand and Interact with 3D World
☆296Updated 3 weeks ago
InternRobotics / EmbodiedScan
[CVPR 2024 & NeurIPS 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
☆614Updated last month
alibaba-damo-academy / WorldVLA
WorldVLA: Towards Autoregressive Action World Model
☆310Updated last month
facebookresearch / open-eqa
OpenEQA Embodied Question Answering in the Era of Foundation Models
☆306Updated 10 months ago
BAAI-DCAI / SpatialBot
The official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models.
☆285Updated 2 months ago
yaotingwangofficial / Awesome-MCoT
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
☆735Updated 3 weeks ago
YueFan1014 / VideoAgent
This is the official code of VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding (ECCV 2024)
☆228Updated 8 months ago
LMM101 / Awesome-Multimodal-Next-Token-Prediction
[Survey] Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
☆446Updated 6 months ago
OuyangKun10 / SpaceR
SpaceR: The first MLLM empowered by SG-RLVR for video spatial reasoning
☆71Updated 3 weeks ago
nvidia-cosmos / cosmos-reason1
Cosmos-Reason1 models understand the physical common sense and generate appropriate embodied decisions in natural language through long c…
☆581Updated this week
ZzZZCHS / Chat-Scene
Code for "Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers" (NeurIPS 2024)
☆177Updated 4 months ago
leofan90 / Awesome-World-Models
A comprehensive list of papers for the definition of World Models and using World Models for General Video Generation, Embodied AI, and A…
☆266Updated this week
deepcs233 / Visual-CoT
[Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought …
☆360Updated 7 months ago
tulerfeng / Awesome-Embodied-Multimodal-LLMs
Latest Advances on Embodied Multimodal LLMs (or Vison-Language-Action Models).
☆117Updated last year
LaVi-Lab / Video-3D-LLM
[CVPR 2025] The code for paper ''Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding''.
☆140Updated 2 months ago
EvolvingLMMs-Lab / EgoLife
[CVPR 2025] EgoLife: Towards Egocentric Life Assistant
☆313Updated 4 months ago
dvlab-research / Seg-Zero
Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"
☆480Updated last week
scene-verse / SceneVerse
Official implementation of ECCV24 paper "SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding"
☆250Updated 4 months ago
zwq2018 / embodied_reasoner
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
☆156Updated 2 months ago
tanhuajie / Reason-RFT
⭐️ Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning.
☆184Updated last week