eval-sys / mcpmarkLinks
MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.
☆378Updated last week
Alternatives and similar repositories for mcpmark
Users that are interested in mcpmark are comparing it to the libraries listed below
Sorting:
- Official repo of Toucan: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments☆223Updated last month
- ☆229Updated 2 weeks ago
- The evaluation benchmark on MCP servers☆238Updated 5 months ago
- Deep Research☆303Updated 5 months ago
- A minimalist MVP demonstrating a simple yet profound insight: aligning AI memory with human episodic memory granularity. Shows how this s…☆161Updated last month
- ☆153Updated last week
- An End-to-End Infrastructure for Training and Evaluating Various LLM Agents☆708Updated this week
- MrlX: A Multi-Agent Reinforcement Learning Framework☆189Updated 2 weeks ago
- Implementation for OAgents: An Empirical Study of Building Effective Agents☆306Updated 3 months ago
- ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization☆95Updated 8 months ago
- ☆40Updated 5 months ago
- ☆131Updated last month
- ☆192Updated 3 months ago
- ☆165Updated last month
- SWE-Swiss: A Multi-Task Fine-Tuning and RL Recipe for High-Performance Issue Resolution☆104Updated 4 months ago
- [ICML 2025] ResearchTown: Simulator of Human Research Community☆192Updated this week
- [NeurIPS 2024] Personal Agentic AI for MultiAgent Cooperation☆87Updated last year
- SkillWeaver is a framework to enable web agent self-improvement through environment exploration and skill synthesis.☆108Updated 9 months ago
- ☆131Updated 8 months ago
- [NeurIPS'25 D&B] Mind2Web-2 Benchmark: Evaluating Agentic Search with Agent-as-a-Judge☆98Updated last month
- The code for paper: Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search☆63Updated 7 months ago
- SE-Agent is a self-evolution framework for LLM Code agents. It enables trajectory-level evolution to exchange information across reasonin…☆227Updated 4 months ago
- Computer Agent Arena: Test & compare AI agents in real desktop apps & web environments. Code/data coming soon!☆52Updated 9 months ago
- DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents☆564Updated 2 months ago
- SkillsBench evaluates how well skills work and how effective agents are at using them☆278Updated last week
- Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory☆246Updated 8 months ago
- Prompt-to-Leaderboard☆271Updated 8 months ago
- Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving☆415Updated 5 months ago
- Omni Model Benchmark with high quality and diversity, which reveals the Compositional Law. We’re now focused on Chinese scenarios — and a…☆74Updated 3 weeks ago
- DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL☆241Updated 4 months ago