sierra-research / tau-bench
Code and Data for Tau-Bench
☆201Updated 3 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for tau-bench
- ☆127Updated 3 months ago
- AWM: Agent Workflow Memory☆205Updated last month
- An Analytical Evaluation Board of Multi-turn LLM Agents☆250Updated 6 months ago
- ☆316Updated last month
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆212Updated last month
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆110Updated 3 weeks ago
- Code for Husky, an open-source language agent that solves complex, multi-step reasoning tasks. Husky v1 addresses numerical, tabular and …☆328Updated 5 months ago
- Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"☆448Updated 8 months ago
- Code for the paper 🌳 Tree Search for Language Model Agents☆138Updated 3 months ago
- Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding☆327Updated 9 months ago
- Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhan…☆496Updated 5 months ago
- A simple unified framework for evaluating LLMs☆145Updated last week
- ☆526Updated last month
- TapeAgents is a framework that facilitates all stages of the LLM Agent development lifecycle☆124Updated this week
- Implementation of Google's SELF-DISCOVER☆282Updated 3 months ago
- FireAct: Toward Language Agent Fine-tuning☆255Updated last year
- Benchmarking LLMs with Challenging Tasks from Real Users☆195Updated 2 weeks ago
- VisualWebArena is a benchmark for multimodal agents.☆244Updated last week
- ☆282Updated 7 months ago
- OS-ATLAS: A Foundation Action Model For Generalist GUI Agents☆166Updated this week
- RewardBench: the first evaluation tool for reward models.☆431Updated 3 weeks ago
- ☆116Updated 5 months ago
- Data and code for FreshLLMs (https://arxiv.org/abs/2310.03214)☆328Updated this week
- ☆103Updated 3 months ago
- [NeurIPS 2022] 🛒WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents☆276Updated 2 months ago
- Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"☆753Updated last month
- This is the official repo for "PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization". PromptAgen…☆204Updated 3 months ago
- 🤠 Agent-as-a-Judge and DevAI dataset☆192Updated this week
- Building Open LLM Web Agents with Self-Evolving Online Curriculum RL☆204Updated this week
- MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering☆517Updated 2 weeks ago