Baiqi-Li / NaturalBenchLinks
π [NeurIPS24] Make Vision Matter in Visual-Question-Answering (VQA)! Introducing NaturalBench, a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.
β84Updated last month
Alternatives and similar repositories for NaturalBench
Users that are interested in NaturalBench are comparing it to the libraries listed below
Sorting:
- (ECCV 2024) Empowering Multimodal Large Language Model as a Powerful Data Generatorβ112Updated 4 months ago
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Modelsβ97Updated last year
- [NAACL 2025 Oral] π From redundancy to relevance: Enhancing explainability in multimodal large language modelsβ107Updated 5 months ago
- [ECCV 2024] Efficient Inference of Vision Instruction-Following Models with Elastic Cacheβ42Updated last year
- Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Modelsβ177Updated 9 months ago
- u-LLaVA: Unifying Multi-Modal Tasks via Large Language Modelβ134Updated 3 months ago
- [CVPR 2023] Official implementation of the paper: Fine-grained Audible Video Descriptionβ73Updated last year
- [ICML 2025] Official repository for paper "Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation"β165Updated 2 months ago
- [ICLR 2025] Mathematical Visual Instruction Tuning for Multi-modal Large Language Modelsβ146Updated 8 months ago
- β67Updated 4 months ago
- Multi-granularity Correspondence Learning from Long-term Noisy Videos [ICLR 2024, Oral]β116Updated last year
- [ICCV 2025] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMsβ69Updated last month
- [ICLR'24] Democratizing Fine-grained Visual Recognition with Large Language Modelsβ180Updated last year
- [ACL 2023 Findings] FACTUAL dataset, the textual scene graph parser trained on FACTUAL.β113Updated last month
- [ICCV 2025] Boosting MLLM Reasoning with Text-Debiased Hint-GRPOβ31Updated last month
- GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?β186Updated last year
- WorldGPT: Empowering LLM as Multimodal World Modelβ117Updated 11 months ago
- [ECCV 2024] Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?β167Updated 3 months ago
- (NeurIPS 2024) Official PyTorch implementation of LOVA3β89Updated 4 months ago
- Your efficient and accurate answer verification system for RL training.β34Updated last month
- β¨β¨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracyβ292Updated 2 months ago
- The repository for the paper titled "Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks"β158Updated 7 months ago
- β29Updated 8 months ago
- (AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questionsβ261Updated last year
- A Gaussian dense reward framework for GUI grounding trainingβ155Updated 2 weeks ago
- (ICCV 2025) Enhance CLIP and MLLM's fine-grained visual representations with generative models.β68Updated last month
- Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations. [EMNLP 2022]β131Updated 10 months ago
- [MM'24 Oral] Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrievalβ127Updated 11 months ago
- [AAAI 2025] Code for paper:Enhancing Multimodal Large Language Models Complex Reasoning via Similarity Computationβ4Updated 6 months ago
- An open-source implementation for training LLaVA-NeXT.β412Updated 9 months ago