AI agent using GPT-4V(ision) capable of using a mouse/keyboard to interact with web UI
โ1,059Dec 9, 2024Updated last year
Alternatives and similar repositories for GPT-4V-Act
Users that are interested in GPT-4V-Act are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [arXiv 2023] Set-of-Mark Prompting for GPT-4V and LMMsโ1,544Aug 19, 2024Updated last year
- ๐๐ง GPT-4 Vision x ๐ชโจ๏ธ Vimium = Autonomous Web Agentโ166Nov 16, 2023Updated 2 years ago
- [ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large multโฆโ849Feb 3, 2025Updated last year
- GPT-4V in Wonderland: LMMs as Smartphone Agentsโ134Jul 17, 2024Updated last year
- [NeurIPS'23 Spotlight] "Mind2Web: Towards a Generalist Agent for the Web" -- the first LLM-based web agent and benchmark for generalist wโฆโ1,003Nov 5, 2025Updated 7 months ago
- Managed Database hosting by DigitalOcean โข AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Code for "WebVoyager: WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models"โ1,096Mar 4, 2024Updated 2 years ago
- An AI agent for interacting with a computer using the graphical user interfaceโ81Oct 12, 2023Updated 2 years ago
- Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"โ1,524Nov 26, 2025Updated 7 months ago
- Browse the web with GPT-4V and Vimiumโ2,652Sep 25, 2024Updated last year
- VisualWebArena is a benchmark for multimodal agents.โ481Nov 9, 2024Updated last year
- a state-of-the-art-level open visual language model | ๅคๆจกๆ้ข่ฎญ็ปๆจกๅโ6,742May 29, 2024Updated 2 years ago
- Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs)โฆโ1,624Jun 13, 2026Updated 2 weeks ago
- AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.โ6,788Mar 19, 2025Updated last year
- Command your browser with GPTโ420Feb 3, 2026Updated 4 months ago
- Deploy on Railway without the complexity - Free Credits Offer โข AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Vision utilities for web interaction agents ๐โ1,764Nov 25, 2024Updated last year
- [COLM 2024] OpenAgents: An Open Platform for Language Agents in the Wildโ4,845Nov 18, 2024Updated last year
- An LLM-based Web Navigating Agent (KDD'24)โ929Sep 27, 2024Updated last year
- An self-improving embodied conversational agent seamlessly integrated into the operating system to automate our daily tasks.โ1,776Sep 9, 2024Updated last year
- WebLINX is a benchmark for building web navigation agents with conversational capabilitiesโ161Feb 11, 2025Updated last year
- The model, data and code for the visual GUI Agent SeeClickโ486Jul 13, 2025Updated 11 months ago
- A framework to enable a multimodal model to operate a computer.โ10,247Sep 19, 2025Updated 9 months ago
- Large Action Model framework to develop AI Web Agentsโ6,375Jan 21, 2025Updated last year
- [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.โ24,881Aug 12, 2024Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer โข AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- An Open-source Framework for Data-centric, Self-evolving Autonomous Language Agentsโ5,935Sep 26, 2024Updated last year
- Mobile-Agent: The Powerful GUI Agent Familyโ8,864May 14, 2026Updated last month
- Unofficial implementation and experiments related to Set-of-Mark (SoM) ๐๏ธโ87Oct 20, 2023Updated 2 years ago
- Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)โ12,921Apr 13, 2026Updated 2 months ago
- [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environmentsโ2,970Updated this week
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)โ3,519Feb 8, 2026Updated 4 months ago
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skillsโ767Feb 1, 2024Updated 2 years ago
- [ACL 2024] PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chainโ107Mar 14, 2024Updated 2 years ago
- [COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMsโ145Aug 23, 2024Updated last year
- Managed Database hosting by DigitalOcean โข AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- โ35Mar 24, 2023Updated 3 years ago
- [ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.โ5,675May 21, 2025Updated last year
- A programming framework for agentic AIโ59,261Apr 15, 2026Updated 2 months ago
- A lightweight coding agent for open models like Deepseek, Kimi, and Qwenโ64,111Jun 20, 2026Updated last week
- AgentTuning: Enabling Generalized Agent Abilities for LLMsโ1,499Oct 31, 2023Updated 2 years ago
- [ICLR 2024] Trajectory-as-Exemplar Prompting with Memory for Computer Controlโ69Jan 7, 2026Updated 5 months ago
- An AutoGPT agent that controls Chrome on your desktopโ1,742Oct 25, 2023Updated 2 years ago