zzxslp / MM-Navigator
GPT-4V in Wonderland: LMMs as Smartphone Agents
☆128Updated 4 months ago
Related projects ⓘ
Alternatives and complementary repositories for MM-Navigator
- Official implementation for "You Only Look at Screens: Multimodal Chain-of-Action Agents" (Findings of ACL 2024)☆198Updated 4 months ago
- Official Repo for UGround☆97Updated 2 weeks ago
- The model, data and code for the visual GUI Agent SeeClick☆226Updated 2 months ago
- GUICourse: From General Vision Langauge Models to Versatile GUI Agents☆83Updated 4 months ago
- Towards Large Multimodal Models as Visual Foundation Agents☆122Updated last week
- WebLINX is a benchmark for building web navigation agents with conversational capabilities☆118Updated last month
- ☆65Updated last year
- Environments, tools, and benchmarks for general computer agents☆172Updated 3 weeks ago
- VisualWebArena is a benchmark for multimodal agents.☆244Updated last week
- GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes fr…☆69Updated last week
- OS-ATLAS: A Foundation Action Model For Generalist GUI Agents☆166Updated this week
- A Universal Platform for Training and Evaluation of Mobile Interaction☆37Updated last week
- Codes for Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models☆124Updated 3 weeks ago
- Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding☆38Updated last month
- The Official Code Repository for GUI-World.☆41Updated 3 months ago
- ☆35Updated last year
- ☆152Updated 4 months ago
- Official repo for paper DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning.☆263Updated last month
- ControlLLM: Augment Language Models with Tools by Searching on Graphs☆186Updated 4 months ago
- ☆116Updated 5 months ago
- ☆51Updated 10 months ago
- ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K …☆91Updated 4 months ago
- This is a collection of resources for computer-use agents, including videos, blogs, papers, and projects.☆102Updated 2 weeks ago
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆62Updated 3 weeks ago
- HPT - Open Multimodal LLMs from HyperGAI☆312Updated 5 months ago
- Long Context Transfer from Language to Vision☆334Updated 3 weeks ago
- Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"☆47Updated last month
- ☆31Updated 8 months ago
- Democratization of "PaLI: A Jointly-Scaled Multilingual Language-Image Model"☆86Updated 8 months ago
- 💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.☆200Updated this week