StonyBrookNLP / appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Paper.
☆111Updated last month
Related projects ⓘ
Alternatives and complementary repositories for appworld
- Benchmarking LLMs with Challenging Tasks from Real Users☆198Updated 2 weeks ago
- Code for the paper 🌳 Tree Search for Language Model Agents☆140Updated 3 months ago
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆104Updated 5 months ago
- Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]☆97Updated last month
- ☆112Updated last month
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆49Updated 9 months ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆124Updated 3 weeks ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆41Updated 9 months ago
- Self-Alignment with Principle-Following Reward Models☆147Updated 8 months ago
- Sotopia: an Open-ended Social Learning Environment (ICLR 2024 spotlight)☆165Updated this week
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆130Updated this week
- ☆116Updated 5 months ago
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆162Updated last month
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]☆127Updated 2 months ago
- LOFT: A 1 Million+ Token Long-Context Benchmark☆146Updated 3 weeks ago
- ☆103Updated last month
- ☆192Updated 3 months ago
- An Analytical Evaluation Board of Multi-turn LLM Agents☆250Updated 6 months ago
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆158Updated 4 months ago
- This is the official repository of the paper "OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI"☆86Updated last month
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)☆99Updated 3 weeks ago
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆118Updated 4 months ago
- Repository for the paper Stream of Search: Learning to Search in Language☆93Updated 3 months ago
- A set of utilities for running few-shot prompting experiments on large-language models☆113Updated last year
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆194Updated 6 months ago
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆128Updated last month
- ToolBench, an evaluation suite for LLM tool manipulation capabilities.☆145Updated 8 months ago
- Resources for our paper: "EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms"☆75Updated last month
- ☆90Updated 4 months ago
- [ACL 2024] AUTOACT: Automatic Agent Learning from Scratch for QA via Self-Planning☆179Updated last month