eigent-ai/toolathlon_gym

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/eigent-ai/toolathlon_gym)

eigent-ai / toolathlon_gym

Toolathlon-Gym for testing AI agents real-world tool-use capabilities across diverse MCP servers.

☆138

Alternatives and similar repositories for toolathlon_gym

Users that are interested in toolathlon_gym are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

hkust-nlp / Toolathlon
View on GitHub
[ICLR 2026] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
☆430Updated this week
sheep333c / DIVE
View on GitHub
DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use
☆26Mar 13, 2026Updated 4 months ago
hkust-nlp / LOCA-bench
View on GitHub
Benchmarking Language Agents Under Controllable and Extreme Context Growth
☆50Apr 29, 2026Updated 2 months ago
scaleapi / mcp-atlas
View on GitHub
MCP Atlas
☆120Updated this week
scaleapi / SWE-Interact
View on GitHub
New testbed of interactive SWE tasks for coding agents, set in a realistic multi-turn developer driven environment
☆21Jun 30, 2026Updated 2 weeks ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
RUC-NLPIR / EnvScaler
View on GitHub
The official implementation of "EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis".
☆175Feb 12, 2026Updated 5 months ago
lukahhcm / Awesome_Environment_Scaling
View on GitHub
Resources and paper list for 'Scaling Environments for Agents'. This repository accompanies our survey on how environments contribute to …
☆71Jan 28, 2026Updated 5 months ago
aisa-group / PostTrainBench
View on GitHub
Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours
☆462Updated this week
Gen-Verse / GenEnv
View on GitHub
GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators
☆62Dec 23, 2025Updated 6 months ago
OpenRewardAI / openreward-cookbook
View on GitHub
Training and evaluating with OpenReward
☆33Apr 28, 2026Updated 2 months ago
zapier / AutomationBench
View on GitHub
A benchmark for evaluating AI agents on realistic business workflows
☆136Updated this week
axon-rl / gem
View on GitHub
A Gym for Agentic LLMs
☆502Jan 21, 2026Updated 6 months ago
hkust-nlp / deepsearch-tts
View on GitHub
Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification
☆21Oct 8, 2025Updated 9 months ago
hkust-nlp / AgentVista
View on GitHub
Benchmarking multimodal agents on realistic, ultra-challenging visual scenarios requiring long-horizon hybrid tool use.
☆65Mar 10, 2026Updated 4 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
Job-Bench / job-bench-eval
View on GitHub
Official eval scripts for JobBench
☆26Updated this week
eval-sys / mcpmark
View on GitHub
MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.
☆449Jun 12, 2026Updated last month
Fu-Dayuan / AgentRefine
View on GitHub
(ICLR 2025) AgentRefine: Enhancing Agent Generalization through Refinement Tuning
☆20Nov 22, 2025Updated 7 months ago
GAIR-NLP / daVinci-Agency
View on GitHub
daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently
☆38Feb 4, 2026Updated 5 months ago
SalesforceAIResearch / UserBench
View on GitHub
☆63Jun 2, 2026Updated last month
ServiceNow / EnterpriseOps-Gym
View on GitHub
Codebase for EnterpriseOps-Gym from ServiceNow
☆109Jul 5, 2026Updated 2 weeks ago
sierra-research / tau2-bench
View on GitHub
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
☆1,622Updated this week
harbor-framework / harbor
View on GitHub
Framework for evaluating and improving agents
☆3,320Updated this week
hkust-nlp / model-task-align-rl
View on GitHub
[ICLR 26] The official code repository for the paper "Mirage or Method? How Model–Task Alignment Induces Divergent RL Conclusions".
☆18Feb 9, 2026Updated 5 months ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
Interplay-LM-Reasoning / Interplay-LM-Reasoning
View on GitHub
[ICML 2026 Spotlight] On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
☆162Jun 8, 2026Updated last month
ByteDance-Seed / EdgeBench
View on GitHub
EdgeBench: Unveiling scaling laws of learning from real-world environments
☆365Updated this week
NovaSky-AI / SkyRL
View on GitHub
SkyRL: A Modular Full-stack RL Library for LLMs
☆2,081Updated this week
TIGER-AI-Lab / verl-tool
View on GitHub
A version of verl to support diverse tool use [TMLR 2026]
☆1,020Updated this week
Zhiyuan-Zeng / RLVE
View on GitHub
[ICML 2026] RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments
☆223Apr 30, 2026Updated 2 months ago
web-arena-x / webarena-infinity
View on GitHub
An approach to utomatically generating browser environment with verifiable tasks
☆63Mar 24, 2026Updated 3 months ago
meituan-longcat / vitabench
View on GitHub
[ICLR 2026] VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
☆157Feb 22, 2026Updated 4 months ago
RUC-NLPIR / ET-Agent
View on GitHub
☆20Jan 18, 2026Updated 6 months ago
OpenDataBox / Workspace-Bench
View on GitHub
Benchmark self-evolving Agent upon realistic large-scale file workspaces
☆43Updated this week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
camel-ai / seta
View on GitHub
💻 SETA: Scaling Environments for Terminal Agents
☆124Updated this week
cmu-l3 / gym-anything
View on GitHub
Gym-Anything: Turn any Software into an Agent Environment
☆262Jul 14, 2026Updated last week
dengmengjie / ToolScope
View on GitHub
Official repository for ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use
☆31Nov 4, 2025Updated 8 months ago
microsoft / Simia-Agent-Training
View on GitHub
Official Implementation of "Simulating Environments with Reasoning Models for Agent Training"
☆65Feb 18, 2026Updated 5 months ago
PrimeIntellect-ai / research-environments
View on GitHub
Environments by the Prime Intellect Research Team
☆77Updated this week
princeton-nlp / benign-data-breaks-safety
View on GitHub
☆47Oct 1, 2024Updated last year
open-thoughts / OpenThoughts-Agent
View on GitHub
Data recipes and robust infrastructure for training AI agents
☆260Updated this week