zzxslp/MM-Navigator

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/zzxslp/MM-Navigator)

zzxslp / MM-Navigator

GPT-4V in Wonderland: LMMs as Smartphone Agents

☆135

Alternatives and similar repositories for MM-Navigator

Users that are interested in MM-Navigator are comparing it to the libraries listed below

Sorting:

shulin16 / MMInA
View on GitHub
[ACL2025 Findings] Benchmarking Multihop Multimodal Internet Agents
☆48Feb 27, 2025Updated last year
shoggoth13 / agents-deconstructed
View on GitHub
☆56Sep 9, 2023Updated 2 years ago
ddupont808 / GPT-4V-Act
View on GitHub
AI agent using GPT-4V(ision) capable of using a mouse/keyboard to interact with web UI
☆1,065Dec 9, 2024Updated last year
google-research-datasets / rico_semantics
View on GitHub
Consists of ~500k human annotations on the RICO dataset identifying various icons based on their shapes and semantics, and associations b…
☆34Jun 27, 2024Updated last year
OpenGVLab / GUI-Odyssey
View on GitHub
[ICCV 2025] GUIOdyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUIOdyssey consists of 8,834 e…
☆147Jan 3, 2026Updated 2 months ago
snap-research / VIMI
View on GitHub
☆13Jul 10, 2024Updated last year
microsoft / SoM
View on GitHub
[arXiv 2023] Set-of-Mark Prompting for GPT-4V and LMMs
☆1,517Aug 19, 2024Updated last year
zzxslp / SoM-LLaVA
View on GitHub
[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
☆145Aug 23, 2024Updated last year
chuyg1005 / seeclick-crawler
View on GitHub
☆20Apr 24, 2024Updated last year
ltzheng / Synapse
View on GitHub
[ICLR 2024] Trajectory-as-Exemplar Prompting with Memory for Computer Control
☆68Jan 7, 2026Updated 2 months ago
RUCBM / GUICourse
View on GitHub
GUICourse: From General Vision Langauge Models to Versatile GUI Agents
☆136Updated this week
VisualWebBench / VisualWebBench
View on GitHub
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"
☆64Oct 19, 2024Updated last year
njucckevin / SeeClick
View on GitHub
The model, data and code for the visual GUI Agent SeeClick
☆469Jul 13, 2025Updated 7 months ago
DCDmllm / HyperLLaVA
View on GitHub
Pytorch implementation of HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
☆28Mar 22, 2024Updated last year
OSU-NLP-Group / Mind2Web
View on GitHub
[NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist w…
☆952Nov 5, 2025Updated 4 months ago
alipay / mobile-agent
View on GitHub
☆44Mar 19, 2024Updated last year
YuxiangChai / A3
View on GitHub
☆35Jan 12, 2026Updated last month
umd-huang-lab / Mementos
View on GitHub
☆32Feb 8, 2024Updated 2 years ago
Dongping-Chen / GUI-World
View on GitHub
(ICLR 2025) The Official Code Repository for GUI-World.
☆68Dec 18, 2024Updated last year
DigiRL-agent / digirl
View on GitHub
Official repo for paper DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning.
☆389Feb 22, 2025Updated last year
aburns4 / MoTIF
View on GitHub
Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments
☆61Aug 19, 2024Updated last year
OSU-NLP-Group / SeeAct
View on GitHub
[ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large mult…
☆830Feb 3, 2025Updated last year
taogoddd / GPT-4V-API
View on GitHub
Self-hosted GPT-4V api
☆27Nov 6, 2023Updated 2 years ago
pkunlp-icler / PCA-EVAL
View on GitHub
[ACL 2024] PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain
☆106Mar 14, 2024Updated last year
MobileLLM / AutoDroid
View on GitHub
Source code for the paper "Empowering LLM to use Smartphone for Intelligent Task Automation"
☆453Mar 22, 2024Updated last year
ggjy / vision_weak_to_strong
View on GitHub
☆38Feb 8, 2024Updated 2 years ago
xbmxb / CoCo-Agent
View on GitHub
☆35Jun 20, 2024Updated last year
aszala / DiagrammerGPT
View on GitHub
Official code repository for: DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning (COLM 2024)
☆154Sep 12, 2024Updated last year
THUDM / VisualAgentBench
View on GitHub
Towards Large Multimodal Models as Visual Foundation Agents
☆256Apr 24, 2025Updated 10 months ago
wooseongY / MOANA-File-Player
View on GitHub
ROS-based File Player for MOANA Dataset
☆25Feb 25, 2025Updated last year
aszala / VPEval
View on GitHub
VPEval Codebase from Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)
☆45Nov 29, 2023Updated 2 years ago
WeihuangLin / INF-LLaVA
View on GitHub
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model
☆42Aug 4, 2024Updated last year
akshgarg7 / Specter
View on GitHub
Designed to help lawyers and legal professionals find precedent fast and prepare for case negotiations by simulating trajectories
☆10Oct 16, 2024Updated last year
RiTUAL-MBZUAI / Font-prediction-dataset
View on GitHub
This is a data repository for the ACL 2020 paper: "Let Me Choose: From Verbal Context to Font Selection"
☆10May 5, 2020Updated 5 years ago
IMNearth / CoAT
View on GitHub
Official implementation for "Android in the Zoo: Chain-of-Action-Thought for GUI Agents" (Findings of EMNLP 2024)
☆100Oct 14, 2024Updated last year
zyang-ur / idea2img
View on GitHub
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation, ECCV 2024
☆22Feb 15, 2024Updated 2 years ago
zai-org / CogVLM
View on GitHub
a state-of-the-art-level open visual language model | 多模态预训练模型
☆6,724May 29, 2024Updated last year
web-arena-x / visualwebarena
View on GitHub
VisualWebArena is a benchmark for multimodal agents.
☆440Nov 9, 2024Updated last year
showlab / videogui
View on GitHub
[NeurIPS 2024 D&B] VideoGUI: A Benchmark for GUI Automation from Instructional Videos
☆51Feb 22, 2026Updated last week