AntResearchNLP / ViLaSRLinks

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

☆79

Alternatives and similar repositories for ViLaSR

Users that are interested in ViLaSR are comparing it to the libraries listed below

Sorting:

multimodal-reasoning-lab / Bagel-Zebra-CoT
https://huggingface.co/datasets/multimodal-reasoning-lab/Zebra-CoT
☆104Updated last month
EvolvingLMMs-Lab / LongVT
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
☆95Updated this week
yu-rp / VisualPerceptionToken
☆130Updated 8 months ago
TIGER-AI-Lab / Pixel-Reasoner
Pixel-Level Reasoning Model trained with RL [NeuIPS25]
☆251Updated 3 weeks ago
OuyangKun10 / SpaceR
SpaceR: The first MLLM empowered by SG-RLVR for video spatial reasoning
☆98Updated 4 months ago
UMass-Embodied-AGI / Mirage
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025)
☆197Updated 4 months ago
yunlong10 / Awesome-Video-LMM-Post-Training
🔥🔥🔥 Latest Papers, Codes and Datasets on Video-LMM Post-Training
☆186Updated 2 weeks ago
Haochen-Wang409 / TreeVGR
Official implementation of "Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology"
☆71Updated last month
hmxiong / StreamChat
Official repo for "Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge" ICLR2025
☆86Updated 8 months ago
xinyan-cxy / MINT-CoT
[NeurIPS 2025] MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
☆88Updated 2 months ago
thuml / MiniVeo3-Reasoner
Thinking with Videos from Open-Source Priors. We reproduce chain-of-frames visual reasoning by fine-tuning open-source video models. Give…
☆186Updated last month
TencentARC / Video-Holmes
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
☆78Updated 4 months ago
Wakals / CoVT
☆112Updated this week
Cooperx521 / ScaleCap
Official repository of 'ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing’
☆58Updated 5 months ago
PzySeere / MetaSpatial
MetaSpatial leverages reinforcement learning to enhance 3D spatial reasoning in vision-language models (VLMs), enabling more structured, …
☆194Updated 7 months ago
tongjingqi / Thinking-with-Video
We introduce 'Thinking with Video', a new paradigm leveraging video generation for multimodal reasoning. Our VideoThinkBench shows that S…
☆212Updated last week
egolife-ai / Ego-R1
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
☆131Updated 3 months ago
zhishuifeiqian / VCR-Bench
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning
☆32Updated 4 months ago
TencentARC / SEED-Bench-R1
☆94Updated 5 months ago
Gabesarch / grounded-rl
☆107Updated 4 months ago
Haochen-Wang409 / ross
[ICLR'25] Reconstructive Visual Instruction Tuning
☆128Updated 7 months ago
zhijie-group / R1-Zero-VSI
☆41Updated 5 months ago
mll-lab-nu / TStar
TStar is a unified temporal search framework for long-form video question answering
☆73Updated 3 months ago
ls-kelvin / REVPT
Code for paper: Reinforced Vision Perception with Tools
☆62Updated 2 months ago
ZJU-REAL / ViewSpatial-Bench
ViewSpatial-Bench:Evaluating Multi-perspective Spatial Localization in Vision-Language Models
☆66Updated 6 months ago
Liuziyu77 / MIA-DPO
Official implement of MIA-DPO
☆67Updated 10 months ago
yale-nlp / MMVU
Data and Code for CVPR 2025 paper "MMVU: Measuring Expert-Level Multi-Discipline Video Understanding"
☆75Updated 9 months ago
Fr0zenCrane / UniCoT
Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
☆173Updated last week
PKU-YuanGroup / Look-Back
This repository is the official implementation of "Look-Back: Implicit Visual Re-focusing in MLLM Reasoning".
☆72Updated 4 months ago
appletea233 / Temporal-R1
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency
☆58Updated 5 months ago