vision-x-nyu / thinking-in-space
Official repo and evaluation implementation of VSI-Bench
☆410Updated 3 weeks ago
Alternatives and similar repositories for thinking-in-space:
Users that are interested in thinking-in-space are comparing it to the libraries listed below
- Compose multimodal datasets 🎹☆309Updated this week
- Official repository for VisionZip (CVPR 2025)☆256Updated 3 weeks ago
- A Simple yet Effective Pathway to Empowering LLaVA to Understand and Interact with 3D World☆226Updated 3 months ago
- [Survey] Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey☆393Updated 2 months ago
- ☆369Updated 10 months ago
- [NeurIPS'24] This repository is the implementation of "SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models"☆142Updated 3 months ago
- This is the official code of VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding (ECCV 2024)☆181Updated 3 months ago
- [ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation☆246Updated last month
- [NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agents☆308Updated 11 months ago
- [ICML 2024] Official code repository for 3D embodied generalist agent LEO☆417Updated 2 months ago
- 📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.☆415Updated last week
- [ICLR 2024 Spotlight] DreamLLM: Synergistic Multimodal Comprehension and Creation☆426Updated 3 months ago
- Long Context Transfer from Language to Vision☆368Updated this week
- [Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought …☆274Updated 2 months ago
- The official repo for "SpatialBot: Precise Spatial Understanding with Vision Language Models.☆219Updated last month
- 🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).☆440Updated this week
- [CVPR 2025] 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".☆289Updated 2 weeks ago
- [TMLR 2025🔥] A survey for the autoregressive models in vision.☆443Updated this week
- [CVPR 2024 & NeurIPS 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI☆563Updated 3 weeks ago
- Official implementation of ECCV24 paper "SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding"☆233Updated this week
- This is the official implementation of "Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams"☆171Updated 2 months ago
- ☆156Updated 3 weeks ago