sierra-research / tau-benchLinks

Code and Data for Tau-Bench

☆713

Alternatives and similar repositories for tau-bench

Users that are interested in tau-bench are comparing it to the libraries listed below

Sorting:

SalesforceAIResearch / xLAM
xLAM: A Family of Large Action Models to Empower AI Agent Systems
☆507Updated last week
openai / mle-bench
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
☆815Updated last month
facebookresearch / swe-rl
Official codebase for "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution"
☆571Updated 4 months ago
mlfoundations / evalchemy
Automatic evals for LLMs
☆488Updated last month
web-arena-x / webarena
Code repo for "WebArena: A Realistic Web Environment for Building Autonomous Agents"
☆1,075Updated 5 months ago
zorazrw / agent-workflow-memory
AWM: Agent Workflow Memory
☆297Updated 6 months ago
ServiceNow / AgentLab
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and re…
☆372Updated this week
WecoAI / aideml
AIDE: AI-Driven Exploration in the Space of Code. The machine Learning engineering agent that automates AI R&D.
☆972Updated this week
aorwall / moatless-tools
☆518Updated last month
SWE-Gym / SWE-Gym
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆513Updated this week
TheAgentCompany / TheAgentCompany
An agent benchmark with tasks in a simulated software company.
☆509Updated this week
LiveCodeBench / LiveCodeBench
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
☆608Updated 2 weeks ago
princeton-nlp / WebShop
[NeurIPS 2022] 🛒WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents
☆379Updated 10 months ago
SalesforceAIResearch / AgentLite
☆616Updated 6 months ago
multi-agent-systems-failure-taxonomy / MAST
☆240Updated last week
zhentingqi / rStar
☆953Updated 6 months ago
apple / ToolSandbox
☆191Updated 11 months ago
StonyBrookNLP / appworld
🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…
☆231Updated 2 months ago
ezelikman / quiet-star
Code for Quiet-STaR
☆735Updated 11 months ago
lapisrocks / LanguageAgentTreeSearch
[ICML 2024] Official repository for "Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models"
☆766Updated last year
madaan / self-refine
LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.
☆719Updated 9 months ago
SWE-bench / SWE-smith
Scaling Data for SWE-agents
☆328Updated this week
magpie-align / magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …
☆736Updated 4 months ago
allenai / reward-bench
RewardBench: the first evaluation tool for reward models.
☆619Updated last month
google-deepmind / long-form-factuality
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
☆627Updated 2 weeks ago
tencent-ailab / persona-hub
Official repo for the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas"
☆1,249Updated 5 months ago
trotsky1997 / MathBlackBox
☆1,028Updated 7 months ago
xingyaoww / code-act
Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhan…
☆1,310Updated last year
NovaSky-AI / SkyRL
SkyRL: A Modular Full-stack RL Library for LLMs
☆679Updated this week
hkust-nlp / AgentBoard
An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]
☆332Updated last year