sierra-research/tau2-bench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/sierra-research/tau2-bench)

sierra-research / tau2-bench

τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

☆1,226

Alternatives and similar repositories for tau2-bench

Users that are interested in tau2-bench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

sierra-research / tau-bench
View on GitHub
Code and Data for Tau-Bench
☆1,246Mar 18, 2026Updated 2 months ago
chenchen0103 / ACEBench
View on GitHub
☆179Oct 29, 2025Updated 7 months ago
futuredialchallenge / 2024-RAG
View on GitHub
A Challenge on Dialog Systems with Retrieval Augmented Generation (FutureDial-RAG), Co-located with SLT2024 FutureDial-RAG Challenge
☆11Aug 10, 2024Updated last year
THUNLP-MT / StableToolBench
View on GitHub
A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.
☆234Apr 15, 2025Updated last year
zai-org / ComplexFuncBench
View on GitHub
Complex Function Calling Benchmark.
☆178Jan 20, 2025Updated last year
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
PeterGriffinJin / Search-R1
View on GitHub
Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL
☆4,753Nov 13, 2025Updated 6 months ago
NVIDIA / When2Call
View on GitHub
A dataset for training and evaluating LLMs on decision making about "when (not) to call" functions
☆63Apr 29, 2025Updated last year
verl-project / verl
View on GitHub
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
☆21,514Updated this week
SalesforceAIResearch / xLAM
View on GitHub
xLAM: A Family of Large Action Models to Empower AI Agent Systems
☆619Aug 21, 2025Updated 9 months ago
NovaSky-AI / SkyRL
View on GitHub
SkyRL: A Modular Full-stack RL Library for LLMs
☆1,894Updated this week
bytedance / FTRL
View on GitHub
[ACL 2026] Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments
☆52Apr 6, 2026Updated last month
openai / simple-evals
View on GitHub
☆4,492Apr 22, 2026Updated last month
SWE-Gym / SWE-Gym
View on GitHub
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆679Jul 29, 2025Updated 10 months ago
THUDM / AgentBench
View on GitHub
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
☆3,444Feb 8, 2026Updated 3 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
Berkeley-NLP / Agent-Eval-Refine
View on GitHub
Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]
☆149Nov 26, 2024Updated last year
openai / mle-bench
View on GitHub
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
☆1,549Apr 24, 2026Updated last month
open-compass / T-Eval
View on GitHub
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
☆306Apr 3, 2024Updated 2 years ago
hkust-nlp / simpleRL-reason
View on GitHub
Simple RL training for reasoning
☆3,859Dec 23, 2025Updated 5 months ago
microsoft / acon
View on GitHub
Official implementation of paper "ACON: Optimizing Context Compression for Long-horizon LLM Agents"
☆80Oct 14, 2025Updated 7 months ago
GAIR-NLP / ToRL
View on GitHub
☆346May 24, 2025Updated last year
OpenRLHF / OpenRLHF
View on GitHub
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Asy…
☆9,548Updated this week
cmu-l3 / gym-anything
View on GitHub
Gym-Anything: Turn any Software into an Agent Environment
☆234May 18, 2026Updated last week
huggingface / lighteval
View on GitHub
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆2,426May 21, 2026Updated last week
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
zjunlp / OneEdit
View on GitHub
OneEdit: A Neural-Symbolic Collaboratively Knowledge Editing System.
☆20Oct 14, 2024Updated last year
hhan1018 / NesTools
View on GitHub
[COLING 2025] NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models
☆18Jan 18, 2025Updated last year
StonyBrookNLP / appworld
View on GitHub
🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource…
☆421Feb 17, 2026Updated 3 months ago
TsinghuaC3I / Awesome-RL-for-LRMs
View on GitHub
A Survey of Reinforcement Learning for Large Reasoning Models
☆2,459Nov 9, 2025Updated 6 months ago
langfengQ / verl-agent
View on GitHub
verl-agent is an extension of veRL, designed for training LLM/VLM agents via RL. verl-agent is also the official code for paper "Group-in…
☆1,944Updated this week
Tencent-Hunyuan / C3-Benchmark
View on GitHub
C^3-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking
☆38Mar 1, 2026Updated 2 months ago
sail-sg / understand-r1-zero
View on GitHub
Understanding R1-Zero-Like Training: A Critical Perspective
☆1,259Aug 27, 2025Updated 9 months ago
hkust-nlp / AgentBoard
View on GitHub
An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]
☆416May 20, 2024Updated 2 years ago
THUDM / ReST-MCTS
View on GitHub
ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search (NeurIPS 2024)
☆705Jan 20, 2025Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
allenai / open-instruct
View on GitHub
AllenAI's post-training codebase
☆3,729Updated this week
willccbb / localchat
View on GitHub
☆14Apr 16, 2025Updated last year
OSU-NLP-Group / TravelPlanner
View on GitHub
[ICML'24 Spotlight] "TravelPlanner: A Benchmark for Real-World Planning with Language Agents"
☆515Updated this week
Junjie-Ye / ToolEyes
View on GitHub
[COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
☆74May 13, 2025Updated last year
InternLM / SWE-Fixer
View on GitHub
☆139May 8, 2025Updated last year
facebookresearch / meta-agents-research-environments
View on GitHub
Meta Agents Research Environments is a comprehensive platform designed to evaluate AI agents in dynamic, realistic scenarios. Unlike stat…
☆497May 21, 2026Updated last week
princeton-nlp / SimPO
View on GitHub
[NeurIPS 2024] SimPO: Simple Preference Optimization with a Reference-Free Reward
☆954Feb 16, 2025Updated last year