SalesforceAIResearch/MCPEval

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/SalesforceAIResearch/MCPEval)

SalesforceAIResearch / MCPEval

MCP-based Agent Deep Evaluation System

☆155

Alternatives and similar repositories for MCPEval

Users that are interested in MCPEval are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

MrBananaHuman / PangyoCorpora
View on GitHub
☆38Oct 4, 2023Updated 2 years ago
wandb / llm-kr-eval
View on GitHub
☆20Jul 24, 2024Updated 2 years ago
icip-cas / LiveMCPBench
View on GitHub
LiveMCPBench is a benchmark for evaluating the ability of agents to navigate and utilize a large-scale MCP toolset. It provides a compreh…
☆104Dec 18, 2025Updated 7 months ago
StanfordMIMI / dspy-helm
View on GitHub
Structured Prompts Improve Evaluation of Language Models
☆15Jun 5, 2026Updated last month
SalesforceAIResearch / MCP-Universe
View on GitHub
MCP-Universe is a comprehensive framework designed for RL training, benchmarking, and developing AI agents for general tool-use.
☆592Jun 23, 2026Updated last month
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
Job-Bench / job-bench-eval
View on GitHub
Official eval scripts for JobBench
☆32Jul 18, 2026Updated last week
salesforce / SRMA
View on GitHub
Contrastive Learning with Model Augmentation
☆18Jun 2, 2026Updated last month
allenai / fluid-benchmarking
View on GitHub
Fluid Language Model Benchmarking
☆29Sep 16, 2025Updated 10 months ago
YongWookHa / kor-text-preprocess
View on GitHub
Korean text data preprocess toolkit for NLP
☆18Jun 11, 2019Updated 7 years ago
instructkr / LogicKor
View on GitHub
한국어 언어모델 다분야 사고력 벤치마크
☆209Oct 17, 2024Updated last year
quantumaikr / KoreanLM
View on GitHub
한국어 언어모델 오픈소스
☆83May 4, 2023Updated 3 years ago
SalesforceAIResearch / UserRL
View on GitHub
The raw UserRL repo under construction
☆114Jun 2, 2026Updated last month
mcp-tool-bench / MCPToolBenchPP
View on GitHub
MCPToolBench++ MCP Model Context Protocol Tool Use Benchmark on AI Agent and Model Tool Use Ability
☆44Mar 17, 2026Updated 4 months ago
suhan1433 / LLM-as-a-judge-using-G-eval
View on GitHub
LLM-as-a-judge using G-eval Scratch
☆15Oct 12, 2025Updated 9 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
Marker-Inc-Korea / AutoRAG-example-korean-embedding-benchmark
View on GitHub
AutoRAG example about benchmarking Korean embeddings.
☆46Oct 2, 2024Updated last year
openkorpos / model-mecab
View on GitHub
MeCab model trained with OpenKorPos.
☆23Jun 19, 2022Updated 4 years ago
DLYuanGod / EfficientLLM
View on GitHub
☆23May 21, 2025Updated last year
daekeun-ml / evaluate-llm-on-korean-dataset
View on GitHub
Performs benchmarking on two Korean datasets with minimal time and effort.
☆45Jan 22, 2026Updated 6 months ago
VisualSphinx / VisualSphinx
View on GitHub
☆17Jun 3, 2025Updated last year
Atipico1 / Kor-IR
View on GitHub
Kor-IR: Korean Information Retrieval Benchmark
☆87Jul 3, 2024Updated 2 years ago
hkust-nlp / RL-Verifier-Robustness
View on GitHub
From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning.
☆24Oct 7, 2025Updated 9 months ago
Accenture / mcp-bench
View on GitHub
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
☆496Oct 7, 2025Updated 9 months ago
HAE-RAE / HAE-RAE-BENCH
View on GitHub
Benchmark in Korean Context
☆139Sep 26, 2023Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
human-rights-corpus / HRC
View on GitHub
#인권코퍼스
☆31Oct 6, 2023Updated 2 years ago
tunib-ai / joker
View on GitHub
AI model designed to test the effectiveness in handling external ethical attacks.
☆11Feb 9, 2026Updated 5 months ago
overfit-brothers / KRX-2024
View on GitHub
☆12Dec 20, 2024Updated last year
JIA-Lab-research / Scaf-GRPO
View on GitHub
Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
☆22Feb 8, 2026Updated 5 months ago
sheep333c / DIVE
View on GitHub
DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use
☆28Mar 13, 2026Updated 4 months ago
kyopark2014 / mcp
View on GitHub
It shows how to use model-context-protocol.
☆40Updated this week
deep-diver / hllama
View on GitHub
hllama is a library which aims to provide a set of utility tools for large language models.
☆10Apr 16, 2024Updated 2 years ago
sionic-ai / Data_KoSuperNI
View on GitHub
StrategyQA 데이터 세트 번역
☆22Apr 12, 2024Updated 2 years ago
stellalisy / alfa
View on GitHub
Repository for the paper: Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning
☆18Feb 21, 2025Updated last year
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
xjzzzzzzzz / MCPSafety
View on GitHub
☆22Dec 18, 2025Updated 7 months ago
IBM / NESTFUL
View on GitHub
Companion code to https://arxiv.org/abs/2409.03797v2
☆19Sep 18, 2025Updated 10 months ago
rungalileo / agent-leaderboard
View on GitHub
Ranking LLMs on agentic tasks
☆225May 21, 2026Updated 2 months ago
mclenhard / mcp-evals
View on GitHub
A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps …
☆132Jun 23, 2025Updated last year
choe-hyonsu-gabrielle / korean-amr-corpus
View on GitHub
Korean Abstract Meaning Representation (AMR) Corpus
☆10Feb 27, 2022Updated 4 years ago
chentong0 / rl-binary-rar
View on GitHub
Official repo for "Binary Retrieval-augmented Reward Mitigates Hallucinations"
☆15Nov 13, 2025Updated 8 months ago
modelcontextprotocol / transports-wg
View on GitHub
Transports Working Group
☆16Updated this week