eval-sys / mcpmarkLinks

MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.

☆382

Alternatives and similar repositories for mcpmark

Users that are interested in mcpmark are comparing it to the libraries listed below

Sorting:

TheAgentArk / Toucan
Official repo of Toucan: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments
☆223Updated last month
meituan-longcat / LongCat-Flash-Thinking-2601
☆239Updated this week
antgroup / Research-Venus
Deep Research
☆303Updated 5 months ago
llm-in-sandbox / llm-in-sandbox
LLM-in-Sandbox Elicits General Agentic Intelligence
☆167Updated last week
sail-sg / FlowReasoner
☆144Updated 9 months ago
AQ-MedAI / MrlX
MrlX: A Multi-Agent Reinforcement Learning Framework
☆189Updated 2 weeks ago
OpenBMB / AgentCPM
An End-to-End Infrastructure for Training and Evaluating Various LLM Agents
☆708Updated this week
neulab / agent-data-protocol
☆131Updated last month
lmarena / p2l
Prompt-to-Leaderboard
☆271Updated 8 months ago
meituan-longcat / UNO-Bench
Omni Model Benchmark with high quality and diversity, which reveals the Compositional Law. We’re now focused on Chinese scenarios — and a…
☆74Updated 3 weeks ago
modelscope / MCPBench
The evaluation benchmark on MCP servers
☆238Updated 5 months ago
Gen-Verse / ScoreFlow
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization
☆95Updated 8 months ago
THUDM / DeepDive
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
☆241Updated 4 months ago
OPPO-PersonalAI / OAgents
Implementation for OAgents: An Empirical Study of Building Effective Agents
☆306Updated 3 months ago
facebookresearch / sweet_rl
Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks
☆260Updated 9 months ago
GavinZhengOI / LiveCodeBench-Pro
☆165Updated last month
VectorSpaceLab / Infomatica
Data Synthesis for Deep Research Based on Semi-Structured Data
☆197Updated last month
hkust-nlp / Toolathlon
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
☆217Updated this week
Bowen1911 / LLMs-Distillation-Quantification
Repo of ACL 2025 Paper "Quantification of Large Language Model Distillation"
☆93Updated 6 months ago
zorazrw / agent-workflow-memory
AWM: Agent Workflow Memory
☆389Updated last month
ByteDance-Seed / WideSearch
WideSearch: Benchmarking Agentic Broad Info-Seeking
☆118Updated 3 months ago
hzy312 / knowledge-r1
IKEA: Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent
☆68Updated 8 months ago
OSU-NLP-Group / SkillWeaver
SkillWeaver is a framework to enable web agent self-improvement through environment exploration and skill synthesis.
☆108Updated 9 months ago
TsinghuaC3I / SSRL
SSRL: Self-Search Reinforcement Learning
☆206Updated 5 months ago
zou-group / sirius
SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning
☆94Updated 2 months ago
bingreeky / GMemory
☆192Updated 3 months ago
microsoft / SWE-bench-Live
[NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!
☆161Updated last week
Gen-Verse / CURE
[NeurIPS 2025 Spotlight] Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning
☆149Updated 4 months ago
nemori-ai / nemori
A minimalist MVP demonstrating a simple yet profound insight: aligning AI memory with human episodic memory granularity. Shows how this s…
☆161Updated last month
FreedomIntelligence / RAG-Instruct
[EMNLP 2025] RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
☆136Updated 9 months ago