google-research-datasets / screen_qa
ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.
☆90Updated 3 months ago
Related projects ⓘ
Alternatives and complementary repositories for screen_qa
- E5-V: Universal Embeddings with Multimodal Large Language Models☆167Updated 3 months ago
- The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations are in text format, and desc…☆48Updated 8 months ago
- GUICourse: From General Vision Langauge Models to Versatile GUI Agents☆78Updated 3 months ago
- The model, data and code for the visual GUI Agent SeeClick☆215Updated 2 months ago
- [NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs☆75Updated 2 weeks ago
- Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)☆196Updated 3 months ago
- a family of highly capabale yet efficient large multimodal models☆161Updated 2 months ago
- ☆165Updated 3 months ago
- Official Repo for UGround☆93Updated this week
- InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions (AAAI2024)☆137Updated 5 months ago
- Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"☆258Updated 4 months ago
- Towards Large Multimodal Models as Visual Foundation Agents☆113Updated last week
- Environments, tools, and benchmarks for general computer agents☆171Updated 2 weeks ago
- MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities (ICML 2024)☆264Updated this week
- This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for E…☆353Updated 3 weeks ago
- Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M d…☆186Updated 2 months ago
- ☆126Updated 5 months ago
- VisualWebArena is a benchmark for multimodal agents.☆235Updated last month
- OS-ATLAS: A Foundation Action Model For Generalist GUI Agents☆118Updated this week
- HPT - Open Multimodal LLMs from HyperGAI☆312Updated 5 months ago
- The huggingface implementation of Fine-grained Late-interaction Multi-modal Retriever.☆68Updated 2 months ago
- LLaVA-HR: High-Resolution Large Language-Vision Assistant☆213Updated 2 months ago
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆62Updated 2 weeks ago
- Official code for paper "UniIR: Training and Benchmarking Universal Multimodal Information Retrievers" (ECCV 2024)☆105Updated last month
- ☆112Updated 8 months ago
- MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning☆133Updated last year
- ☆178Updated last week
- [NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of…☆98Updated 2 weeks ago
- GPT-4V in Wonderland: LMMs as Smartphone Agents☆128Updated 3 months ago
- ☆152Updated 4 months ago