[NeurIPS'25 D&B] Mind2Web-2 Benchmark: Evaluating Agentic Search with Agent-as-a-Judge
☆111May 17, 2026Updated this week
Alternatives and similar repositories for Mind2Web-2
Users that are interested in Mind2Web-2 are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- An Illusion of Progress? Assessing the Current State of Web Agents☆176Jan 2, 2026Updated 4 months ago
- ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry☆49Jan 5, 2026Updated 4 months ago
- [ICLR'25] "Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers"☆43Mar 31, 2025Updated last year
- MUFFIN: Curating Multi-Faceted Instructions for Improving Instruction-Following☆16Oct 31, 2024Updated last year
- [ICLR'25] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery☆137Apr 29, 2026Updated 3 weeks ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Awesome GUI Agent Paper List☆785May 12, 2026Updated last week
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆54Feb 23, 2024Updated 2 years ago
- ☆28Mar 10, 2026Updated 2 months ago
- ☆25Apr 3, 2025Updated last year
- ☆165May 16, 2026Updated last week
- True Few-Shot BioIE: Benchmarking GPT-3 In-Context and Small PLM Fine-Tuning☆12Jul 6, 2022Updated 3 years ago
- Evaluating GPT-OSS on BrowseComp-Plus with Native Browsering Tools☆20Oct 17, 2025Updated 7 months ago
- [ACL2025 Findings] Benchmarking Multihop Multimodal Internet Agents☆54Feb 27, 2025Updated last year
- [TMLR'25] "Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents"☆101Oct 5, 2025Updated 7 months ago
- GPUs on demand by Runpod - Special Offer Available • AdRun AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- An Autonomous Curriculum Reinforcement Learning framework that steers agents to continually learn in specific environments with zero huma…☆32May 13, 2026Updated last week
- Code and data for paper "Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation".☆24Oct 22, 2025Updated 7 months ago
- [ICLR'25 Oral] UGround: Universal GUI Visual Grounding for GUI Agents☆315Mar 11, 2026Updated 2 months ago
- Bridging the Generalization Gap in Text-to-SQL Parsing with Schema Expansion☆13Jul 26, 2023Updated 2 years ago
- ☆18Mar 3, 2025Updated last year
- The source code for running LLMs on the AAAR-1.0 benchmark.☆18Apr 5, 2025Updated last year
- ☆45Dec 16, 2025Updated 5 months ago
- Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments (EMNLP'2024)☆37Dec 29, 2024Updated last year
- ☆13Feb 6, 2025Updated last year
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- [ICML'24] SeeAct is a system for generalist web agents that autonomously carry out tasks on any given website, with a focus on large mult…☆846Feb 3, 2025Updated last year
- ☆186Oct 31, 2025Updated 6 months ago
- Code for the paper 🌳 Tree Search for Language Model Agents☆222Jul 25, 2024Updated last year
- [ICLR2025 Spotlight] Agent Trajectory Synthesis via Guiding Replay with Web Tutorials☆54Feb 21, 2025Updated last year
- ☆55Apr 13, 2026Updated last month
- [COLM'24] How Easily do Irrelevant Inputs Skew the Responses of Large Language Models?☆22Oct 13, 2024Updated last year
- [NeurIPS 2025 𝐒𝐩𝐨𝐭𝐥𝐢𝐠𝐡𝐭] AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling☆44Mar 28, 2026Updated last month
- [COLM'24] "Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning"☆21Jun 14, 2024Updated last year
- ☆23Aug 14, 2023Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Reproducible and flexible LLM evaluations for scientific reasoning.☆28Jul 23, 2025Updated 10 months ago
- Official repository of the video reasoning benchmark MMR-V. Can Your MLLMs "Think with Video"? [ICLR26]☆39Jun 23, 2025Updated 11 months ago
- [ICML 2026 Spotlight] Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback☆66May 4, 2026Updated 2 weeks ago
- Lean formalizations of IMO problem statements☆35Apr 23, 2026Updated last month
- ☆13Nov 11, 2022Updated 3 years ago
- The official repo of "WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents"☆117Sep 29, 2025Updated 7 months ago
- APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation. A system-level optimization for scalable LLM tra…☆57Oct 11, 2025Updated 7 months ago