StonyBrookNLP / appworldLinks
π Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Paper.
β250Updated last month
Alternatives and similar repositories for appworld
Users that are interested in appworld are comparing it to the libraries listed below
Sorting:
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasksβ245Updated 5 months ago
- Code for the paper π³ Tree Search for Language Model Agentsβ217Updated last year
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]β354Updated last year
- A banchmark list for evaluation of large language models.β143Updated 3 weeks ago
- Framework and toolkits for building and evaluating collaborative agents that can work together with humans.β99Updated this week
- [NeurIPS 2024] Agent Planning with World Knowledge Modelβ149Updated 9 months ago
- AWM: Agent Workflow Memoryβ326Updated 8 months ago
- [COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agentsβ166Updated 2 months ago
- β240Updated last year
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)β151Updated 11 months ago
- Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]β144Updated 10 months ago
- Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike statβ¦β263Updated last week
- An Illusion of Progress? Assessing the Current State of Web Agentsβ88Updated 2 months ago
- [NeurIPS 2022] πWebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agentsβ401Updated last year
- Official Implementation of Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimizationβ169Updated last year
- Code and example data for the paper: Rule Based Rewards for Language Model Safetyβ198Updated last year
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.β184Updated 5 months ago
- VisualWebArena is a benchmark for multimodal agents.β386Updated 10 months ago
- "Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents"β83Updated last week
- This repository contains a LLM benchmark for the social deduction game `Resistance Avalon'β126Updated 4 months ago
- [ACL 2024] AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planningβ230Updated 8 months ago
- Benchmarking LLMs with Challenging Tasks from Real Usersβ241Updated 11 months ago
- [ICLR 2024] MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Useβ96Updated last year
- augmented LLM with self reflectionβ133Updated last year
- An extensible benchmark for evaluating large language models on planningβ410Updated 2 weeks ago
- β318Updated 4 months ago
- Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL"β193Updated 5 months ago
- Sotopia: an Open-ended Social Learning Environment (ICLR 2024 spotlight)β246Updated 2 weeks ago
- [ICLR 2025] Benchmarking Agentic Workflow Generationβ129Updated 7 months ago
- LOFT: A 1 Million+ Token Long-Context Benchmarkβ212Updated 3 months ago