StonyBrookNLP / appworldLinks
π AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource Paper.
β307Updated this week
Alternatives and similar repositories for appworld
Users that are interested in appworld are comparing it to the libraries listed below
Sorting:
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasksβ249Updated 6 months ago
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]β360Updated last year
- AWM: Agent Workflow Memoryβ353Updated 9 months ago
- Code for the paper π³ Tree Search for Language Model Agentsβ217Updated last year
- Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike statβ¦β348Updated this week
- Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]β147Updated 11 months ago
- β239Updated last year
- Framework and toolkits for building and evaluating collaborative agents that can work together with humans.β107Updated 2 weeks ago
- An Illusion of Progress? Assessing the Current State of Web Agentsβ107Updated last week
- A banchmark list for evaluation of large language models.β149Updated 2 months ago
- Building Open LLM Web Agents with Self-Evolving Online Curriculum RLβ472Updated 5 months ago
- [NeurIPS 2024] Agent Planning with World Knowledge Modelβ154Updated 10 months ago
- [TMLR'25] "Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents"β91Updated last month
- Code and example data for the paper: Rule Based Rewards for Language Model Safetyβ202Updated last year
- β326Updated 5 months ago
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.β193Updated 7 months ago
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)β152Updated last year
- β215Updated 7 months ago
- [COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agentsβ185Updated 4 months ago
- β116Updated 9 months ago
- Benchmarking LLMs with Challenging Tasks from Real Usersβ244Updated last year
- [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agentsβ128Updated 7 months ago
- A simple unified framework for evaluating LLMsβ254Updated 7 months ago
- [NeurIPS 2022] πWebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agentsβ428Updated last year
- [ICLR 2025] Benchmarking Agentic Workflow Generationβ132Updated 8 months ago
- An extensible benchmark for evaluating large language models on planningβ429Updated last month
- FireAct: Toward Language Agent Fine-tuningβ284Updated 2 years ago
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)β218Updated 2 years ago
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Zihaβ¦β133Updated last year
- Towards Large Multimodal Models as Visual Foundation Agentsβ242Updated 6 months ago