StonyBrookNLP / appworldLinks
π AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource Paper.
β290Updated this week
Alternatives and similar repositories for appworld
Users that are interested in appworld are comparing it to the libraries listed below
Sorting:
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasksβ246Updated 5 months ago
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]β355Updated last year
- Code for the paper π³ Tree Search for Language Model Agentsβ217Updated last year
- AWM: Agent Workflow Memoryβ335Updated 8 months ago
- [COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agentsβ176Updated 3 months ago
- [NeurIPS 2022] πWebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agentsβ411Updated last year
- β239Updated last year
- Building Open LLM Web Agents with Self-Evolving Online Curriculum RLβ463Updated 4 months ago
- [NeurIPS 2024] Agent Planning with World Knowledge Modelβ151Updated 10 months ago
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.β191Updated 6 months ago
- A banchmark list for evaluation of large language models.β145Updated last month
- Framework and toolkits for building and evaluating collaborative agents that can work together with humans.β103Updated this week
- An Illusion of Progress? Assessing the Current State of Web Agentsβ99Updated 3 months ago
- [ICLR 2025] Benchmarking Agentic Workflow Generationβ129Updated 8 months ago
- VisualWebArena is a benchmark for multimodal agents.β392Updated 11 months ago
- β323Updated 4 months ago
- Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]β146Updated 10 months ago
- Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike statβ¦β321Updated last week
- Code and example data for the paper: Rule Based Rewards for Language Model Safetyβ201Updated last year
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)β151Updated 11 months ago
- A Comprehensive Benchmark for Software Development.β115Updated last year
- β210Updated 6 months ago
- Towards Large Multimodal Models as Visual Foundation Agentsβ240Updated 6 months ago
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]β553Updated 2 months ago
- β116Updated 9 months ago
- An extensible benchmark for evaluating large language models on planningβ419Updated last month
- [TMLR'25] "Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents"β88Updated 2 weeks ago
- Reproducible, flexible LLM evaluationsβ257Updated last week
- [ACL 2024] AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planningβ229Updated 9 months ago
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)β214Updated 2 years ago