aim-uofa / Active-o3Links
ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
☆68Updated 2 months ago
Alternatives and similar repositories for Active-o3
Users that are interested in Active-o3 are comparing it to the libraries listed below
Sorting:
- ☆41Updated 2 months ago
- Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025)☆115Updated last week
- ☆87Updated last month
- Visual Planning: Let's Think Only with Images☆262Updated 2 months ago
- Official Repo of Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration☆73Updated 2 months ago
- Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better☆36Updated last month
- SpaceR: The first MLLM empowered by SG-RLVR for video spatial reasoning☆71Updated last month
- Pixel-Level Reasoning Model trained with RL☆187Updated last month
- Code and dataset link for "DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World"☆96Updated last month
- The official repository for our paper, "Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning".☆125Updated 3 weeks ago
- Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning☆97Updated last month
- Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?☆63Updated 3 weeks ago
- ☆62Updated this week
- MetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, …☆162Updated 3 months ago
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆76Updated 5 months ago
- ☆47Updated 2 months ago
- Structured Video Comprehension of Real-World Shorts☆132Updated this week
- [CVPR 25] A framework named B^2-DiffuRL for RL-based diffusion model fine-tuning.☆34Updated 4 months ago
- [CVPR'2025] VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".☆180Updated last month
- ☆30Updated 8 months ago
- ☆194Updated this week
- Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing☆55Updated last week
- Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces☆77Updated 2 months ago
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment☆54Updated 2 weeks ago
- Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision☆66Updated this week
- [CVPR 2025] Official PyTorch Implementation of GLUS: Global-Local Reasoning Unified into A Single Large Language Model for Video Segmenta…☆48Updated last month
- TStar is a unified temporal search framework for long-form video question answering☆59Updated 4 months ago
- Multi-SpatialMLLM Multi-Frame Spatial Understanding with Multi-Modal Large Language Models☆140Updated 2 months ago
- Implementation for "The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer"☆58Updated 2 weeks ago
- [ICLR'25] Reconstructive Visual Instruction Tuning☆101Updated 4 months ago