Compose multimodal datasets πΉ
β548Jan 5, 2026Updated 2 months ago
Alternatives and similar repositories for VQASynth
Users that are interested in VQASynth are comparing it to the libraries listed below
Sorting:
- [NeurIPS'24] This repository is the implementation of "SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models"β313Dec 14, 2024Updated last year
- Official repo and evaluation implementation of VSI-Benchβ675Aug 5, 2025Updated 7 months ago
- The official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models.β336Sep 14, 2025Updated 5 months ago
- Awesome-LLM-3D: a curated list of Multi-modal Large Language Model in 3D world Resourcesβ2,120Feb 3, 2026Updated last month
- [NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning abilities of MLLMs and LLMsβ59Jan 23, 2025Updated last year
- Official repository of Learning to Act from Actionless Videos through Dense Correspondences.β249Apr 25, 2024Updated last year
- Code for 3D-LLM: Injecting the 3D World into Large Language Modelsβ1,183Jun 6, 2024Updated last year
- A Vision-Language Model for Spatial Affordance Prediction in Roboticsβ214Jul 17, 2025Updated 7 months ago
- Dreamitate: Real-World Visuomotor Policy Learning via Video Generation (CoRL 2024)β58Jun 7, 2025Updated 9 months ago
- Code of 3DMIT: 3D MULTI-MODAL INSTRUCTION TUNING FOR SCENE UNDERSTANDINGβ31Jul 26, 2024Updated last year
- [NeurIPS 2025] Official implementation of Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligenceβ444Feb 5, 2026Updated last month
- β443Nov 29, 2025Updated 3 months ago
- β12Jan 10, 2025Updated last year
- [ICCV 2025] A Simple yet Effective Pathway to Empowering LLaVA to Understand and Interact with 3D Worldβ373Oct 21, 2025Updated 4 months ago
- Embodied Reasoning Question Answer (ERQA) Benchmarkβ262Mar 12, 2025Updated 11 months ago
- [ICLR 2026] OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Modelsβ80Jan 21, 2026Updated last month
- [TACL'23] VSR: A probing benchmark for spatial undersranding of vision-language models.β140Mar 25, 2023Updated 2 years ago
- Evaluating and reproducing real-world robot manipulation policies (e.g., RT-1, RT-1-X, Octo) in simulation under common setups (e.g., Gooβ¦β991Dec 20, 2025Updated 2 months ago
- Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMsβ65Jan 1, 2026Updated 2 months ago
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Modelsβ785Feb 20, 2025Updated last year
- [ICML 2024] 3D-VLA: A 3D Vision-Language-Action Generative World Modelβ623Oct 29, 2024Updated last year
- A fork to add multimodal model training to open-r1β1,496Feb 8, 2025Updated last year
- OpenEQA Embodied Question Answering in the Era of Foundation Modelsβ341Sep 20, 2024Updated last year
- VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and clouβ¦β3,771Nov 28, 2025Updated 3 months ago
- β152Aug 23, 2023Updated 2 years ago
- β78May 23, 2025Updated 9 months ago
- [NeurIPS 2025] SpatialLM: Training Large Language Models for Structured Indoor Modelingβ4,261Sep 26, 2025Updated 5 months ago
- β4,582Sep 14, 2025Updated 5 months ago
- [NeurIPS 2024] Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understandingβ100Feb 2, 2025Updated last year
- [CVPR 2024] The code for paper 'Towards Learning a Generalist Model for Embodied Navigation'β229Jun 18, 2024Updated last year
- [ICML 2024] LEO: An Embodied Generalist Agent in 3D Worldβ477Apr 20, 2025Updated 10 months ago
- OpenVLA: An open-source vision-language-action model for robotic manipulation.β5,383Mar 23, 2025Updated 11 months ago
- MetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, β¦β205May 5, 2025Updated 10 months ago
- Code for "Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation"β301Apr 22, 2024Updated last year
- [ICLR 2025 Oral] Official Implementation for "Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Unβ¦β21Oct 24, 2024Updated last year
- A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyondβ100Updated this week
- β¨First Open-Source R1-like Video-LLM [2025/02/18]β382Feb 23, 2025Updated last year
- Solve Visual Understanding with Reinforced VLMsβ5,855Oct 21, 2025Updated 4 months ago
- Cambrian-1 is a family of multimodal LLMs with a vision-centric design.β1,988Nov 7, 2025Updated 4 months ago