xbench-ai/xbench-evals

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/xbench-ai/xbench-evals)

xbench-ai / xbench-evals

Evergreen, contamination-free, real-world, domain-specific AI evaluation framework

☆139

Alternatives and similar repositories for xbench-evals

Users that are interested in xbench-evals are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

PALIN2018 / BrowseComp-ZH
View on GitHub
☆158May 14, 2025Updated last year
RUCAIBox / SimpleDeepSearcher
View on GitHub
SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis
☆120Jun 3, 2025Updated last year
Ayanami0730 / deep_research_bench
View on GitHub
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
☆792May 11, 2026Updated 2 months ago
GAIR-NLP / DeepResearcher
View on GitHub
Scaling Deep Research via Reinforcement Learning in Real-world Environments.
☆781May 10, 2026Updated 2 months ago
ByteDance-Seed / WideSearch
View on GitHub
WideSearch: Benchmarking Agentic Broad Info-Seeking
☆147Oct 9, 2025Updated 9 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
RedSearchAgent / DeepTraceHub
View on GitHub
RedSearcher's framework for deep search agent trajectory synthesis, QA filtering, and model evaluation, supporting ReACT and DeepSeek-sty…
☆23Feb 26, 2026Updated 4 months ago
MiroMindAI / MiroRL
View on GitHub
MiroRL is an MCP-first reinforcement learning framework for deep research agent.
☆246Aug 27, 2025Updated 10 months ago
inclusionAI / ASearcher
View on GitHub
An Open-Source Large-Scale Reinforcement Learning Project for Search Agents
☆602Nov 26, 2025Updated 7 months ago
texttron / BrowseComp-Plus
View on GitHub
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent (ACL 2026 Main)
☆316May 28, 2026Updated last month
xbench-ai / AgentIF-OneDay
View on GitHub
☆36Mar 23, 2026Updated 3 months ago
benchflow-ai / env0
View on GitHub
☆16Jul 11, 2026Updated last week
qhjqhj00 / awesome-agentic-search
View on GitHub
🔍 Awesome Agentic Search is a curated list of papers, tools, and resources on agentic search—where AI agents plan, search, and reason to…
☆60Aug 28, 2025Updated 10 months ago
lblankl / Short-RL
View on GitHub
Short RL
☆19Apr 16, 2026Updated 3 months ago
vickywu1022 / OntoProbe-PLMs
View on GitHub
Repo for outstanding paper@ACL 2023 "Do PLMs Know and Understand Ontological Knowledge?"
☆33Oct 16, 2023Updated 2 years ago
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
ai-agents-2030 / awesome-deep-research-agent
View on GitHub
☆624Sep 18, 2025Updated 10 months ago
PeterGriffinJin / Search-R1
View on GitHub
Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL
☆5,123Nov 13, 2025Updated 8 months ago
benchflow-ai / ClawsBench
View on GitHub
Repository for results and data (coming soon!) for ClawsBench
☆30Apr 8, 2026Updated 3 months ago
HKUDS / DeepResearch-Eval
View on GitHub
"DeepResearch-Eval: An End-to-End Evaluation Framework for DeepResearch Systems"
☆49Oct 16, 2025Updated 9 months ago
DavidZWZ / Awesome-Deep-Research
View on GitHub
[ACL 2026 KnowFM] Awesome Agentic Deep Research Resources
☆807Jul 12, 2026Updated last week
RedSearchAgent / REDSearcher
View on GitHub
REDSearch: A scalable, cost-efficient framework for long-horizon search agents. Features complex task synthesis, optimized mid-training, …
☆128Feb 26, 2026Updated 4 months ago
NovaSky-AI / SkyRL-OpenHands
View on GitHub
☆37Nov 26, 2025Updated 7 months ago
LivingFutureLab / ChineseSimpleQA
View on GitHub
☆79Jan 24, 2025Updated last year
xspadex / bilibili-mcp
View on GitHub
☆14Apr 23, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
OPPO-PersonalAI / Flash-Searcher
View on GitHub
Official Implementation of Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution
☆88Dec 8, 2025Updated 7 months ago
RUCAIBox / R1-Searcher
View on GitHub
R1-searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
☆720Aug 5, 2025Updated 11 months ago
sierra-research / tau-bench
View on GitHub
Code and Data for Tau-Bench
☆1,337Mar 18, 2026Updated 4 months ago
SUFE-AIFLM-Lab / FinGAIA
View on GitHub
☆24Oct 29, 2025Updated 8 months ago
LuLuLuyi / R-HORIZON
View on GitHub
[ICLR'2026] R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
☆18Oct 21, 2025Updated 9 months ago
RUCAIBox / OlymMATH
View on GitHub
The OlymMATH dataset
☆24Jun 1, 2025Updated last year
hkust-nlp / Toolathlon
View on GitHub
[ICLR 2026] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
☆430Updated this week
AgentR1 / Agent-R1
View on GitHub
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
☆1,551Jul 13, 2026Updated last week
harbor-framework / terminal-bench
View on GitHub
A benchmark for LLMs on complicated tasks in the terminal
☆2,467Jul 11, 2026Updated last week
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
NJU-LINK / DR3-Eval
View on GitHub
☆38May 7, 2026Updated 2 months ago
Simple-Efficient / RL-Factory
View on GitHub
Train your Agent model via our easy and efficient framework
☆1,773Dec 5, 2025Updated 7 months ago
OPPO-PersonalAI / FINDER_DEFT
View on GitHub
Official implementation for paper "How Far Are We from Genuinely Useful Deep Research Agents?"
☆65Dec 10, 2025Updated 7 months ago
thunlp / AutoForm
View on GitHub
Code for paper "Beyond Natural Language: LLMs Leveraging Alternative Formats for Enhanced Reasoning and Communication"
☆23Mar 30, 2024Updated 2 years ago
OSU-NLP-Group / Mind2Web-2
View on GitHub
[NeurIPS'25 D&B] Mind2Web-2 Benchmark: Evaluating Agentic Search with Agent-as-a-Judge
☆111May 17, 2026Updated 2 months ago
cxcscmu / deepresearch_benchmarking
View on GitHub
☆29Mar 10, 2026Updated 4 months ago
mcp-tool-bench / MCPToolBenchPP
View on GitHub
MCPToolBench++ MCP Model Context Protocol Tool Use Benchmark on AI Agent and Model Tool Use Ability
☆44Mar 17, 2026Updated 4 months ago