hughplay / Visual-Reasoning-PapersLinks
π A curated list of visual reasoning papers.
β31Updated 3 months ago
Alternatives and similar repositories for Visual-Reasoning-Papers
Users that are interested in Visual-Reasoning-Papers are comparing it to the libraries listed below
Sorting:
- [Arxiv] Aligning Modalities in Vision Large Language Models via Preference Fine-tuningβ90Updated last year
- [TACL'23] VSR: A probing benchmark for spatial undersranding of vision-language models.β139Updated 2 years ago
- Official repository for the A-OKVQA datasetβ109Updated last year
- [ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuningβ296Updated last year
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervisionβ72Updated last year
- β231Updated 2 years ago
- [COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMsβ145Updated last year
- Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR β¦β291Updated 2 years ago
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension.β69Updated last year
- [ACL 2024] PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chainβ106Updated last year
- [ICLR 2024] Analyzing and Mitigating Object Hallucination in Large Vision-Language Modelsβ155Updated last year
- β67Updated 2 years ago
- [NeurIPS 2024] A task generation and model evaluation system for multimodal language models.β73Updated last year
- Code for our ACL 2025 paper "Language Repository for Long Video Understanding"β34Updated last year
- β117Updated 6 months ago
- Code and datasets for "Whatβs βupβ with vision-language models? Investigating their struggle with spatial reasoning".β70Updated last year
- Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimizationβ100Updated 2 years ago
- β100Updated last year
- Code for the paper "AutoPresent: Designing Structured Visuals From Scratch" (CVPR 2025)β154Updated 8 months ago
- This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.orβ¦β159Updated 4 months ago
- [ICLR 2024 (Spotlight)] "Frozen Transformers in Language Models are Effective Visual Encoder Layers"β247Updated 2 years ago
- Source code for the Paper "Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models"β18Updated last week
- β155Updated last year
- [EMNLP'23] The official GitHub page for ''Evaluating Object Hallucination in Large Vision-Language Models''β105Updated 5 months ago
- Reinforcement Learning of Vision Language Models with Self Visual Perception Rewardβ160Updated 4 months ago
- [CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(β¦β325Updated 3 months ago
- EMNLP2023 - InfoSeek: A New VQA Benchmark focus on Visual Info-Seeking Questionsβ25Updated last year
- β30Updated last year
- MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learningβ138Updated 4 months ago
- β360Updated 2 years ago