gasse / webarena-setupLinks
Setup scripts for the WebArena benchmark
☆11Updated last week
Alternatives and similar repositories for webarena-setup
Users that are interested in webarena-setup are comparing it to the libraries listed below
Sorting:
- "Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents"☆77Updated 2 months ago
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learning☆99Updated last month
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆102Updated 3 months ago
- Code for "RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing"☆22Updated 3 months ago
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆54Updated last year
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆125Updated last year
- Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments (EMNLP'2024)☆37Updated 6 months ago
- Official repository for ACL 2025 paper "Model Extrapolation Expedites Alignment"☆73Updated last month
- ☆33Updated 4 months ago
- Revisiting Mid-training in the Era of RL Scaling☆62Updated 2 months ago
- Implementation for the paper "The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning"☆61Updated 2 weeks ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆144Updated 8 months ago
- [ICLR'24 Spotlight] "Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts"☆69Updated last year
- The rule-based evaluation subset and code implementation of Omni-MATH☆22Updated 6 months ago
- Official Implementation of ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay☆79Updated last month
- The evaluation code for MultiIF multi-turn and multi-lingual instruction following☆45Updated 8 months ago
- Official implementation of the paper "From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large L…☆50Updated last year
- ☆68Updated 3 months ago
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)☆144Updated 7 months ago
- The repository for ACL 2024 paper "TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models"☆31Updated last year
- Code and data used in the paper: "Training on Incorrect Synthetic Data via RL Scales LLM Math Reasoning Eight-Fold"☆30Updated last year
- Sotopia-π: Interactive Learning of Socially Intelligent Language Agents (ACL 2024)☆65Updated last year
- Suri: Multi-constraint instruction following for long-form text generation (EMNLP’24)☆23Updated 7 months ago
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆127Updated 11 months ago
- [ICML 2025] M-STAR (Multimodal Self-Evolving TrAining for Reasoning) Project. Diving into Self-Evolving Training for Multimodal Reasoning☆61Updated 6 months ago
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision☆121Updated 9 months ago
- Critique-out-Loud Reward Models☆66Updated 8 months ago
- A dataset for training and evaluating LLMs on decision making about "when (not) to call" functions☆23Updated 2 months ago
- [ICLR 25 Oral] RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style☆50Updated this week
- Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering☆60Updated 6 months ago