Challenges for general-purpose web-browsing AI agents
☆67Jun 2, 2025Updated 9 months ago
Alternatives and similar repositories for webgames
Users that are interested in webgames are comparing it to the libraries listed below
Sorting:
- Official Repo for InSTA: Towards Internet-Scale Training For Agents☆56Jul 11, 2025Updated 7 months ago
- [ACL'25 (Findings)] Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents☆26Feb 17, 2026Updated 2 weeks ago
- Setup scripts for the WebArena benchmark☆19Jun 19, 2025Updated 8 months ago
- Code and dataset for NAACL 2022 paper "CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination" Hyounghun Kim, Abhay Zala, Mohi…☆16Nov 26, 2022Updated 3 years ago
- Benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents☆24May 7, 2025Updated 9 months ago
- [ACL2025 Findings] Benchmarking Multihop Multimodal Internet Agents☆48Feb 27, 2025Updated last year
- Building Open LLM Web Agents with Self-Evolving Online Curriculum RL☆512Jun 6, 2025Updated 9 months ago
- ☆32Aug 17, 2025Updated 6 months ago
- WebLINX is a benchmark for building web navigation agents with conversational capabilities☆160Feb 11, 2025Updated last year
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆261May 5, 2025Updated 10 months ago
- ☆67Mar 6, 2025Updated last year
- [NeurIPS'25 D&B] Mind2Web-2 Benchmark: Evaluating Agentic Search with Agent-as-a-Judge☆102Updated this week
- 🌎💪 BrowserGym, a Gym environment for web task automation☆1,140Feb 10, 2026Updated 3 weeks ago
- Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL"☆202Apr 17, 2025Updated 10 months ago
- VisualWebArena is a benchmark for multimodal agents.☆440Nov 9, 2024Updated last year
- WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?☆234Feb 23, 2026Updated last week
- This is the code repo for our paper "Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts".☆45Sep 27, 2025Updated 5 months ago
- ☆231Feb 24, 2025Updated last year
- MiniGPT-Pancreas: Multimodal Large language Model for Pancreas Cancer Classification and Detection☆11Sep 19, 2025Updated 5 months ago
- Public teaching materials for Reasoning and Agents☆12May 29, 2025Updated 9 months ago
- Some microbenchmarks and design docs before commencement☆12Feb 1, 2021Updated 5 years ago
- ☆12Jul 6, 2022Updated 3 years ago
- ☆17Sep 3, 2025Updated 6 months ago
- CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics☆27Nov 1, 2025Updated 4 months ago
- A Samsung Tizen TV driver for Appium☆12Updated this week
- Code for NeurIPS 2022 Datasets and Benchmarks paper - EgoTaskQA: Understanding Human Tasks in Egocentric Videos.☆37Apr 17, 2023Updated 2 years ago
- 人类本质鉴定器☆37Nov 23, 2018Updated 7 years ago
- [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments☆2,608Updated this week
- MobileManager is an application used for automation testing of iOS and Android mobile devices.☆10Jan 6, 2023Updated 3 years ago
- AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence☆10Mar 2, 2025Updated last year
- A Node.js library that enables communication with iOS devices using remote XPC services. It supports device lockdown, property-list (plis…☆19Feb 20, 2026Updated 2 weeks ago
- Code repository supporting the paper "Auto-Generating Weak Labels for Real & Synthetic Data to Improve Label-Scarce Medical Image Segment…☆11Apr 29, 2024Updated last year
- A benchmark dataset designed to support the development and evaluation of large language models (LLMs) for conversational mental health a…☆17Feb 24, 2025Updated last year
- 强化学习贪吃蛇☆14Oct 19, 2023Updated 2 years ago
- ☆38Jan 19, 2026Updated last month
- ☆13Jun 26, 2025Updated 8 months ago
- ☆31Feb 26, 2026Updated last week
- Official repo of paper LM2☆47Feb 13, 2025Updated last year
- Planning with Deep Neural Networks: A Survey☆44Apr 10, 2023Updated 2 years ago