eval-sys / mcpmarkLinks
MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.
☆317Updated last week
Alternatives and similar repositories for mcpmark
Users that are interested in mcpmark are comparing it to the libraries listed below
Sorting:
- Official repo of Toucan: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments☆169Updated last month
- Deep Research☆303Updated 2 months ago
- Draft-Target Disaggregation LLM Serving System via Parallel Speculative Decoding.☆117Updated last week
- [EMNLP 2025] RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions☆136Updated 7 months ago
- AI benchmark runtime framework that allows you to integrate and evaluate AI tasks using Docker-based benchmarks.☆164Updated 6 months ago
- The evaluation benchmark on MCP servers☆225Updated 2 months ago
- DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL☆204Updated last month
- MrlX: A Multi-Agent Reinforcement Learning Framework☆129Updated 2 weeks ago
- DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents☆476Updated this week
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆463Updated this week
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆250Updated 6 months ago
- Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving☆387Updated 3 months ago
- An open platform for enhancing the capability of LLMs in workflow orchestration.☆178Updated 8 months ago
- A minimalist MVP demonstrating a simple yet profound insight: aligning AI memory with human episodic memory granularity. Shows how this s…☆125Updated 2 weeks ago
- ☆301Updated 5 months ago
- Implementation for OAgents: An Empirical Study of Building Effective Agents☆282Updated last month
- [DAI 2025] Beyond GPT-5: Making LLMs Cheaper and Better via Performance–Efficiency Optimized Routing☆178Updated 2 weeks ago
- The All-in-one Judge Models introduced by Opencompass☆114Updated 4 months ago
- [ICML 2025] ResearchTown: Simulator of Human Research Community☆183Updated this week
- ☆293Updated 4 months ago
- Data Synthesis for Deep Research Based on Semi-Structured Data☆179Updated last week
- A high-performance inference engine for LLMs, optimized for diverse AI accelerators.☆707Updated this week
- Omni Model Benchmark with high quality and diversity, which reveals the Compositional Law. We’re now focused on Chinese scenarios — and a…☆72Updated 2 weeks ago
- [NeurIPS 2025 D&B] 🚀 SWE-bench Goes Live!☆135Updated this week
- ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization☆90Updated 6 months ago
- ☆45Updated last month
- Benchmarking Chat Assistants on Long-Term Interactive Memory (ICLR 2025)☆278Updated 3 weeks ago
- SE-Agent is a self-evolution framework for LLM Code agents. It enables trajectory-level evolution to exchange information across reasonin…☆196Updated 2 months ago
- Computer Agent Arena: Test & compare AI agents in real desktop apps & web environments. Code/data coming soon!☆50Updated 7 months ago
- [NeurIPS 2024] Personal Agentic AI for MultiAgent Cooperation☆87Updated 11 months ago