SalesforceAIResearch / MCPEvalLinks
MCP-based Agent Deep Evaluation System
☆135Updated last month
Alternatives and similar repositories for MCPEval
Users that are interested in MCPEval are comparing it to the libraries listed below
Sorting:
- The official implementation of the paper "Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models".☆83Updated 7 months ago
- Official Repo for CRMArena and CRMArena-Pro☆119Updated 4 months ago
- Matrix (Multi-Agent daTa geneRation Infra and eXperimentation framework) is a versatile engine for multi-agent conversational data genera…☆99Updated this week
- Jina VDR is a multilingual, multi-domain benchmark for visual document retrieval☆30Updated 2 months ago
- Source code of "How to Correctly do Semantic Backpropagation on Language-based Agentic Systems" 🤖☆76Updated 10 months ago
- The code repository of the paper: Competition and Attraction Improve Model Fusion☆161Updated 2 months ago
- A method for steering llms to better follow instructions☆55Updated 2 months ago
- Source code for the collaborative reasoner research project at Meta FAIR.☆103Updated 6 months ago
- accompanying material for sleep-time compute paper☆117Updated 6 months ago
- ☆50Updated last year
- [EMNLP 2025] The official implementation for paper "Agentic-R1: Distilled Dual-Strategy Reasoning"☆101Updated 2 months ago
- ScreenSuite - The most comprehensive benchmarking suite for GUI Agents!☆130Updated last month
- Code for the paper "Coding Agents with Multimodal Browsing are Generalist Problem Solvers"☆88Updated this week
- [NAACL2025] LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications☆127Updated 3 months ago
- Implementation of the paper: "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?"☆63Updated 10 months ago
- Official Repo for The Paper "Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems"☆57Updated 8 months ago
- ☆101Updated last year
- ☆232Updated 3 months ago
- ☆40Updated 10 months ago
- ☆79Updated last month
- A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.☆169Updated this week
- [ACL 2025] Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems☆108Updated 4 months ago
- ☆48Updated last year
- Systematic evaluation framework that automatically rates overthinking behavior in large language models.☆93Updated 5 months ago
- ☆79Updated 9 months ago
- Code for our paper PAPILLON: PrivAcy Preservation from Internet-based and Local Language MOdel ENsembles☆59Updated 5 months ago
- Leveraging Base Language Models for Few-Shot Synthetic Data Generation☆36Updated last week
- DIffbot LLM Inference Server☆201Updated 2 months ago
- [ACL 2024] Do Large Language Models Latently Perform Multi-Hop Reasoning?☆79Updated 7 months ago
- Train your own SOTA deductive reasoning model☆109Updated 7 months ago