Baiqi-Li / NaturalBenchLinks
π [NeurIPS24] Make Vision Matter in Visual-Question-Answering (VQA)! Introducing NaturalBench, a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.
β84Updated last month
Alternatives and similar repositories for NaturalBench
Users that are interested in NaturalBench are comparing it to the libraries listed below
Sorting:
- (ECCV 2024) Empowering Multimodal Large Language Model as a Powerful Data Generatorβ109Updated 2 months ago
- [NAACL 2025 Oral] π From redundancy to relevance: Enhancing explainability in multimodal large language modelsβ95Updated 3 months ago
- [ECCV 2024] Efficient Inference of Vision Instruction-Following Models with Elastic Cacheβ43Updated 10 months ago
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Modelsβ93Updated last year
- Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Modelsβ175Updated 7 months ago
- [ICLR 2025] Mathematical Visual Instruction Tuning for Multi-modal Large Language Modelsβ142Updated 5 months ago
- (NeurIPS 2024) Official PyTorch implementation of LOVA3β85Updated 2 months ago
- [ICML 2025] Official repository for paper "Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation"β145Updated 2 weeks ago
- β66Updated 2 months ago
- [CVPR 2023] Official implementation of the paper: Fine-grained Audible Video Descriptionβ73Updated last year
- [ECCV 2024] Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?β163Updated last month
- The repository for the paper titled "Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks"β155Updated 5 months ago
- A post-training method to enhance CLIP's fine-grained visual representations with generative models.β50Updated 2 months ago
- Multi-granularity Correspondence Learning from Long-term Noisy Videos [ICLR 2024, Oral]β113Updated last year
- u-LLaVA: Unifying Multi-Modal Tasks via Large Language Modelβ132Updated last month
- RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Responseβ41Updated 5 months ago
- WorldGPT: Empowering LLM as Multimodal World Modelβ116Updated 9 months ago
- [ICLR'24] Democratizing Fine-grained Visual Recognition with Large Language Modelsβ176Updated 10 months ago
- [MM'24 Oral] Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrievalβ126Updated 9 months ago
- [AAAI 2025] Code for paper:Enhancing Multimodal Large Language Models Complex Reasoning via Similarity Computationβ3Updated 4 months ago
- FACTUAL benchmark dataset, the pre-trained textual scene graph parser trained on FACTUAL.β111Updated this week
- CoS: Chain-of-Shot Prompting for Long Video Understandingβ48Updated 3 months ago
- β29Updated 6 months ago
- [ICML 2025 Spotlight] An official implementation of VideoRoPE: What Makes for Good Video Rotary Position Embedding?β146Updated last month
- Official code of paper "Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models"β74Updated last week
- R1-like Computer-use Agentβ73Updated 2 months ago
- β¨β¨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracyβ285Updated 3 weeks ago
- Official implementation of X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Modelsβ154Updated 6 months ago
- GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?β187Updated last year
- Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations. [EMNLP 2022]β130Updated 8 months ago