eval-sys / mcpmarkLinks
MCP Servers are shaping the future of software. MCPMark is a comprehensive, stress-testing benchmark and a collection of diverse, verifiable tasks designed to evaluate model capabilities in real-world MCP use.
☆89Updated this week
Alternatives and similar repositories for mcpmark
Users that are interested in mcpmark are comparing it to the libraries listed below
Sorting:
- A lightweight script for processing HTML page to markdown format with support for code blocks☆79Updated last year
- Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving☆341Updated 2 weeks ago
- Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors (ACL Findings 2025)☆83Updated 3 months ago
- SkillWeaver is a framework to enable web agent self-improvement through environment exploration and skill synthesis.☆90Updated 4 months ago
- 🍎APPL: A Prompt Programming Language. Seamlessly integrate LLMs with programs.☆259Updated 6 months ago
- Repo for "MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability"☆140Updated 3 months ago
- Computer Agent Arena: Test & compare AI agents in real desktop apps & web environments. Code/data coming soon!☆50Updated 4 months ago
- Inference code of Lingma SWE-GPT☆239Updated 9 months ago
- Using APPL to reimplement popular algorithms for Large Language Models (LLMs) and prompts☆45Updated 7 months ago
- ☆293Updated 3 months ago
- ☆136Updated last week
- LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios systematically.☆164Updated 3 months ago
- Deep Reasoning Translation (DRT) Project☆228Updated this week
- Official implementation for "ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization"☆82Updated 3 months ago
- 珠算代码大模型(Abacus Code LLM)☆56Updated 11 months ago
- IKEA: Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent☆64Updated 3 months ago
- [ICML 2025] ResearchTown: Simulator of Human Research Community☆172Updated this week
- MiroThinker is open-source agentic models trained for deep research and complex tool use scenarios.☆274Updated this week
- Efficient Agent Training for Computer Use☆129Updated 2 months ago
- A minimalist MVP demonstrating a simple yet profound insight: aligning AI memory with human episodic memory granularity. Shows how this s…☆68Updated 3 weeks ago
- Enable tool-use ability for any LLM model (DeepSeek V3/R1, etc.)☆53Updated 3 months ago
- TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles.☆155Updated 10 months ago
- Chrome / Edge extension to turn arXiv papers into Markdown codes in one click.☆81Updated 5 months ago
- [COLM 2025] An Open Math Pre-trainng Dataset with 370B Tokens.☆100Updated 4 months ago
- [ACL 2025] AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant☆38Updated 8 months ago
- An open platform for enhancing the capability of LLMs in workflow orchestration.☆168Updated 5 months ago
- ☆103Updated 8 months ago
- [ACL25' Findings] SWE-Dev is an SWE agent with a scalable test case construction pipeline.☆53Updated last month
- The code for paper: Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search☆54Updated last month
- Challenges for general-purpose web-browsing AI agents☆64Updated 3 months ago