MINT-SJTU / STI-BenchLinks
STI-Bench : Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
☆22Updated last week
Alternatives and similar repositories for STI-Bench
Users that are interested in STI-Bench are comparing it to the libraries listed below
Sorting:
- ☆37Updated last month
- SpaceR: The first MLLM empowered by SG-RLVR for video spatial reasoning☆69Updated last week
- [CVPR'24 Highlight] The official code and data for paper "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Lan…☆60Updated 3 months ago
- Awesome paper for multi-modal llm with grounding ability☆18Updated 11 months ago
- [ICLR 2023] CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding☆45Updated last month
- ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration☆46Updated 6 months ago
- Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces☆75Updated last month
- [NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning abilities of MLLMs and LLMs☆45Updated 5 months ago
- ☆87Updated 3 weeks ago
- ☆63Updated this week
- [NeurIPS-24] This is the official implementation of the paper "DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effect…☆37Updated last year
- Official repository of DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models☆85Updated 10 months ago
- ☆17Updated 2 months ago
- ☆45Updated 6 months ago
- Repo for paper "T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs"☆49Updated 4 months ago
- [NeurIPS 2024] Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning☆69Updated 5 months ago
- Official repo for EscapeCraft (an 3D environment for room escape) and benchmark MM-Escape. This work is accepted by ICCV 2025.☆27Updated last week
- [NeurIPS 2024] Efficient Large Multi-modal Models via Visual Context Compression☆60Updated 4 months ago
- [NeurIPS-2024] The offical Implementation of "Instruction-Guided Visual Masking"☆35Updated 8 months ago
- ☆70Updated 7 months ago
- The official repository for our paper, "Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning".☆95Updated this week
- Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025)☆102Updated last week
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆60Updated this week
- [CVPR'25] 🌟🌟 EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering☆34Updated 3 weeks ago
- ACL'24 (Oral) Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback☆67Updated 10 months ago
- Open-vocabulary Semantic Segmentation☆33Updated last year
- Repository of paper: Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models☆37Updated last year
- Official repo for "Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge" ICLR2025☆59Updated 4 months ago
- [CVPR 2024] Data and benchmark code for the EgoExoLearn dataset☆62Updated 10 months ago
- Improving 3D Large Language Model via Robust Instruction Tuning☆60Updated 4 months ago