eval-sys / mcpmarkLinks
MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.
☆382Updated last week
Alternatives and similar repositories for mcpmark
Users that are interested in mcpmark are comparing it to the libraries listed below
Sorting:
- Official repo of Toucan: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments☆223Updated last month
- ☆239Updated this week
- Deep Research☆303Updated 5 months ago
- LLM-in-Sandbox Elicits General Agentic Intelligence☆167Updated last week
- ☆144Updated 9 months ago
- MrlX: A Multi-Agent Reinforcement Learning Framework☆189Updated 2 weeks ago
- An End-to-End Infrastructure for Training and Evaluating Various LLM Agents☆708Updated this week
- ☆131Updated last month
- Prompt-to-Leaderboard☆271Updated 8 months ago
- Omni Model Benchmark with high quality and diversity, which reveals the Compositional Law. We’re now focused on Chinese scenarios — and a…☆74Updated 3 weeks ago
- The evaluation benchmark on MCP servers☆238Updated 5 months ago
- ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization☆95Updated 8 months ago
- DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL☆241Updated 4 months ago
- Implementation for OAgents: An Empirical Study of Building Effective Agents☆306Updated 3 months ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆260Updated 9 months ago
- ☆165Updated last month
- Data Synthesis for Deep Research Based on Semi-Structured Data☆197Updated last month
- The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution☆217Updated this week
- Repo of ACL 2025 Paper "Quantification of Large Language Model Distillation"☆93Updated 6 months ago
- AWM: Agent Workflow Memory☆389Updated last month
- WideSearch: Benchmarking Agentic Broad Info-Seeking☆118Updated 3 months ago
- IKEA: Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent☆68Updated 8 months ago
- SkillWeaver is a framework to enable web agent self-improvement through environment exploration and skill synthesis.☆108Updated 9 months ago
- SSRL: Self-Search Reinforcement Learning☆206Updated 5 months ago
- SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning☆94Updated 2 months ago
- ☆192Updated 3 months ago
- [NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!☆161Updated last week
- [NeurIPS 2025 Spotlight] Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning☆149Updated 4 months ago
- A minimalist MVP demonstrating a simple yet profound insight: aligning AI memory with human episodic memory granularity. Shows how this s…☆161Updated last month
- [EMNLP 2025] RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions☆136Updated 9 months ago