StonyBrookNLP / appworldLinks
π AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource Paper.
β346Updated last month
Alternatives and similar repositories for appworld
Users that are interested in appworld are comparing it to the libraries listed below
Sorting:
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasksβ254Updated 7 months ago
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]β375Updated last year
- Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike statβ¦β405Updated last month
- A banchmark list for evaluation of large language models.β153Updated 3 months ago
- AWM: Agent Workflow Memoryβ372Updated 10 months ago
- Framework and toolkits for building and evaluating collaborative agents that can work together with humans.β113Updated 3 weeks ago
- Official Implementation of Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimizationβ189Updated last year
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.β202Updated 8 months ago
- [NeurIPS 2024] Agent Planning with World Knowledge Modelβ158Updated last year
- Code for the paper π³ Tree Search for Language Model Agentsβ216Updated last year
- Resources for our paper: "Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training"β165Updated 2 months ago
- β328Updated 6 months ago
- β242Updated last year
- β218Updated 8 months ago
- Building Open LLM Web Agents with Self-Evolving Online Curriculum RLβ486Updated 6 months ago
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)β159Updated last year
- β205Updated last month
- [ICLR 2025] Benchmarking Agentic Workflow Generationβ140Updated 10 months ago
- [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agentsβ131Updated 8 months ago
- An Illusion of Progress? Assessing the Current State of Web Agentsβ125Updated last week
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)β266Updated last year
- A Framework for LLM-based Multi-Agent Reinforced Training and Inferenceβ377Updated last month
- Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]β147Updated last year
- [COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agentsβ217Updated 5 months ago
- This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.β326Updated last year
- Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"β181Updated 7 months ago
- β213Updated 6 months ago
- [NeurIPS 2022] πWebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agentsβ446Updated last year
- A Comprehensive Benchmark for Software Development.β124Updated last year
- A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.β249Updated 8 months ago