sierra-research / tau-bench
Code and Data for Tau-Bench
☆272Updated 3 weeks ago
Alternatives and similar repositories for tau-bench:
Users that are interested in tau-bench are comparing it to the libraries listed below
- AWM: Agent Workflow Memory☆241Updated 3 weeks ago
- ☆362Updated last month
- ☆349Updated 2 weeks ago
- ☆156Updated 6 months ago
- AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and re…☆235Updated this week
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆207Updated this week
- Code for the paper 🌳 Tree Search for Language Model Agents☆178Updated 6 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆145Updated 2 months ago
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]☆281Updated 9 months ago
- 🤠 Agent-as-a-Judge and DevAI dataset☆322Updated last month
- Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding☆369Updated last year
- ☆574Updated last month
- Automatic Evals for LLMs☆266Updated this week
- Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"☆866Updated 2 weeks ago
- A comprehensive repository of reasoning tasks for LLMs (and beyond)☆411Updated 4 months ago
- Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhan…☆600Updated 8 months ago
- VisualWebArena is a benchmark for multimodal agents.☆295Updated 3 months ago
- An agent benchmark with tasks in a simulated software company.☆243Updated this week
- Code for Husky, an open-source language agent that solves complex, multi-step reasoning tasks. Husky v1 addresses numerical, tabular and …☆336Updated 8 months ago
- ☆438Updated 4 months ago
- Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"☆462Updated 11 months ago
- A simple unified framework for evaluating LLMs☆197Updated 2 weeks ago
- WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?☆160Updated this week
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆616Updated last month
- 🌎💪 BrowserGym, a Gym environment for web task automation☆527Updated 2 weeks ago
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆208Updated 9 months ago
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym☆354Updated last month
- [NeurIPS 2022] 🛒WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents☆305Updated 5 months ago
- A compilation of the best multi-agent papers☆379Updated last week
- Search-o1: Agentic Search-Enhanced Large Reasoning Models☆628Updated last week