google-research-datasets / screen_annotationLinks

The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations are in text format, and describe the UI elements present on the screen: their type, location, OCR text and a short description. It has been introduced in the paper `ScreenAI: A Vision-Language Model for UI and Infographics Understanding`.

☆77

Alternatives and similar repositories for screen_annotation

Users that are interested in screen_annotation are comparing it to the libraries listed below

Sorting:

google-research-datasets / screen_qa
ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K …
☆128Updated 8 months ago
RUCBM / GUICourse
GUICourse: From General Vision Langauge Models to Versatile GUI Agents
☆130Updated last year
OSU-NLP-Group / UGround
[ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents
☆278Updated 2 months ago
likaixin2000 / ScreenSpot-Pro-GUI-Grounding
GUI Grounding for Professional High-Resolution Computer Use
☆263Updated 3 weeks ago
cooelf / Auto-GUI
Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)
☆251Updated last year
zzxslp / MM-Navigator
GPT-4V in Wonderland: LMMs as Smartphone Agents
☆135Updated last year
xlang-ai / aguvis
[ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
☆362Updated 7 months ago
njucckevin / SeeClick
The model, data and code for the visual GUI Agent SeeClick
☆431Updated 2 months ago
MBZUAI-LLM / web2code
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
☆90Updated 11 months ago
OpenGVLab / GUI-Odyssey
[ICCV 2025] GUIOdyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUIOdyssey consists of 8,834 e…
☆130Updated 2 months ago
IMNearth / CoAT
Official implementation for "Android in the Zoo: Chain-of-Action-Thought for GUI Agents" (Findings of EMNLP 2024)
☆94Updated 11 months ago
OS-Copilot / OS-Atlas
OS-ATLAS: A Foundation Action Model For Generalist GUI Agents
☆385Updated 5 months ago
THUDM / VisualAgentBench
Towards Large Multimodal Models as Visual Foundation Agents
☆238Updated 5 months ago
ltzheng / agent-studio
[ICLR 2025] A trinity of environments, tools, and benchmarks for general virtual agents
☆216Updated 3 months ago
Dongping-Chen / GUI-World
(ICLR 2025) The Official Code Repository for GUI-World.
☆65Updated 9 months ago
McGill-NLP / weblinx
WebLINX is a benchmark for building web navigation agents with conversational capabilities
☆155Updated 8 months ago
neulab / MultiUI
Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding
☆52Updated 9 months ago
X-LANCE / Mobile-Env
A Universal Platform for Training and Evaluation of Mobile Interaction
☆55Updated 2 weeks ago
xlang-ai / OSWorld-G
[NeurIPS 2025 Spotlight] Scaling Computer-Use Grounding via UI Decomposition and Synthesis
☆113Updated 3 months ago
THUDM / Android-Lab
☆239Updated last month
web-arena-x / visualwebarena
VisualWebArena is a benchmark for multimodal agents.
☆388Updated 11 months ago
chuyg1005 / seeclick-crawler
☆20Updated last year
OS-Copilot / OS-Genesis
[ACL 2025] Code and data for OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
☆163Updated last month
HyperGAI / HPT
HPT - Open Multimodal LLMs from HyperGAI
☆315Updated last year
Yan98 / GTA1
☆101Updated last week
OpenGVLab / ZeroGUI
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
☆92Updated 2 months ago
YuxiangChai / AMEX-codebase
☆31Updated last year
hewei2001 / ReachQA
[EMNLP 2025] Distill Visual Chart Reasoning Ability from LLMs to MLLMs
☆55Updated last month
OpenGVLab / ControlLLM
ControlLLM: Augment Language Models with Tools by Searching on Graphs
☆193Updated last year
MILVLG / imp
a family of highly capabale yet efficient large multimodal models
☆191Updated last year