kyegomez / ScreenAI
Implementation of the ScreenAI model from the paper: "A Vision-Language Model for UI and Infographics Understanding"
☆340Updated last month
Alternatives and similar repositories for ScreenAI
Users that are interested in ScreenAI are comparing it to the libraries listed below
Sorting:
- The model, data and code for the visual GUI Agent SeeClick☆368Updated 5 months ago
- [ICML2025] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction☆289Updated 2 months ago
- OS-ATLAS: A Foundation Action Model For Generalist GUI Agents☆332Updated 3 weeks ago
- The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations are in text format, and desc…☆70Updated last year
- 💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.☆668Updated 2 months ago
- ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K …☆115Updated 3 months ago
- GUI Grounding for Professional High-Resolution Computer Use☆197Updated 2 months ago
- [ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large mult…☆742Updated 3 months ago
- [ICLR 2025] A trinity of environments, tools, and benchmarks for general virtual agents☆201Updated 3 weeks ago
- ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (IJCAI-24)☆451Updated 5 months ago
- Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)☆234Updated 9 months ago
- [CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.☆1,238Updated 2 months ago
- [ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents☆219Updated last week
- AndroidWorld is an environment and benchmark for autonomous agents☆308Updated last week
- ☆116Updated last year
- This is a collection of resources for computer-use GUI agents, including videos, blogs, papers, and projects.☆358Updated 3 weeks ago
- [NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web"☆820Updated last month
- WebLINX is a benchmark for building web navigation agents with conversational capabilities☆146Updated 3 months ago
- Official repo for paper DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning.☆353Updated 2 months ago
- Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.☆691Updated last week
- GPT-4V in Wonderland: LMMs as Smartphone Agents☆134Updated 9 months ago
- VisualWebArena is a benchmark for multimodal agents.☆336Updated 6 months ago
- Building Open LLM Web Agents with Self-Evolving Online Curriculum RL☆373Updated 2 weeks ago
- Open-sourced, Fast and Context-aware Action Grounding from GUI Instructions for GUI/Computer-use Agents☆357Updated 3 months ago
- HPT - Open Multimodal LLMs from HyperGAI☆315Updated 11 months ago
- Code for "WebVoyager: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models"☆769Updated last year
- ☆297Updated last year
- GUICourse: From General Vision Langauge Models to Versatile GUI Agents☆114Updated 9 months ago
- Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"☆985Updated 3 months ago
- Code and implementations for the paper "AgentGym: Evolving Large Language Model-based Agents across Diverse Environments" by Zhiheng Xi e…☆462Updated 2 months ago