icip-cas/LiveMCPBench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/icip-cas/LiveMCPBench)

icip-cas / LiveMCPBench

LiveMCPBench is a benchmark for evaluating the ability of agents to navigate and utilize a large-scale MCP toolset. It provides a comprehensive set of tasks that challenge agents to effectively use various tools in daily scenarios.

☆93

Alternatives and similar repositories for LiveMCPBench

Users that are interested in LiveMCPBench are comparing it to the libraries listed below

Sorting:

panruotong / CAG
View on GitHub
Implementation of Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation. Paper: https://arxiv.org/abs/2404.06809
☆22Oct 22, 2024Updated last year
snumprlab / hima
View on GitHub
Official Implementation of HIMA (COLM'25)
☆19Nov 25, 2025Updated 3 months ago
Longin-Yu / ComRoPE
View on GitHub
☆12Jun 11, 2025Updated 8 months ago
PacktPublishing / DeepSeek-in-Practice
View on GitHub
DeepSeek Essentials, published by Packt
☆29Jan 27, 2026Updated last month
gxy-gxy / DeepRAG
View on GitHub
DeepRAG: Thinking to Retrieve Step by Step for Large Language Models
☆33Feb 17, 2026Updated 2 weeks ago
WenyiWU0111 / CoMEM
View on GitHub
This is the official code repository for the paper: Towards General Continuous Memory for Vision-Language Models.
☆21Jul 3, 2025Updated 8 months ago
paul-rottger / msts-multimodal-safety
View on GitHub
Röttger et al. (2025): "MSTS: A Multimodal Safety Test Suite for Vision-Language Models"
☆16Mar 31, 2025Updated 11 months ago
casedone / rag-multimodal
View on GitHub
☆40Aug 4, 2025Updated 7 months ago
shuzhangzhong / HybriMoE-Preview
View on GitHub
☆17Apr 9, 2025Updated 10 months ago
LG-AI-EXAONE / K-EXAONE
View on GitHub
Official repository for K-EXAONE built by LG AI Research
☆69Feb 6, 2026Updated last month
LaVi-Lab / LongContextReasoner
View on GitHub
[ACL 2024] Making Long-Context Language Models Better Multi-Hop Reasoners
☆19May 28, 2024Updated last year
linhaowei1 / kumo
View on GitHub
☁️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models
☆19Jun 4, 2025Updated 9 months ago
ttw1018 / MoPE-DST
View on GitHub
The code for "MoPE: Mixture of Prefix Experts for Zero-Shot Dialogue State Tracking"
☆19Jan 25, 2025Updated last year
ShenzheZhu / JailDAM
View on GitHub
[COLM 2025] JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
☆25Nov 25, 2025Updated 3 months ago
tianyi-lab / C3PO
View on GitHub
[COLM 2025] "C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing"
☆20Apr 9, 2025Updated 10 months ago
PacktPublishing / Harnessing-Ollama---Create-Secure-Local-LLM-Solutions-with-Python
View on GitHub
Harnessing Ollama - Create Secure Local LLM Solutions with Python, Published by Packt Publishing
☆22Dec 19, 2024Updated last year
BryceZhuo / PolyCom
View on GitHub
The official implementation of ICLR 2025 paper "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".
☆18Apr 25, 2025Updated 10 months ago
tongxuluo / LeaP
View on GitHub
Code, Data and Model for Paper "Learning from Peers in Reasoning Models"
☆27May 13, 2025Updated 9 months ago
corca-ai / evaluating-gpt-4o-on-CLIcK
View on GitHub
Evaluate gpt-4o on CLIcK (Korean NLP Dataset)
☆20May 18, 2024Updated last year
thunlp / SparsingLaw
View on GitHub
The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".
☆30Nov 12, 2024Updated last year
icip-cas / SSO
View on GitHub
A scalable automated alignment method for large language models. Resources for "Aligning Large Language Models via Self-Steering Optimiza…
☆20Nov 21, 2024Updated last year
RUCKBReasoning / CodeRM
View on GitHub
Official code implementation for the ACL 2025 paper: 'Dynamic Scaling of Unit Tests for Code Reward Modeling'
☆27May 16, 2025Updated 9 months ago
regent-research / regent
View on GitHub
☆27Jan 22, 2025Updated last year
pixeli99 / MixLN
View on GitHub
[ICLR 2025] Official Pytorch Implementation of "Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN" by Pengxia…
☆29Jul 24, 2025Updated 7 months ago
Siddhartha80 / AI-Powered-Predictive-Maintenance-System-for-Vehicles-with-Real-Time-Data-Visualization-and-Analysis
View on GitHub
Gradient Boosting Models on Real-Time Sensor Data for AI-Enhanced Vehicle Predictive Maintenance. By using a web-based interface to forec…
☆19Nov 17, 2024Updated last year
Quehry / HelloBench
View on GitHub
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models
☆53Nov 26, 2024Updated last year
technion-cs-nlp / hallucination-mitigation
View on GitHub
☆23Dec 17, 2024Updated last year
DynaMath / DynaMath
View on GitHub
A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
☆28Nov 25, 2024Updated last year
sionic-ai / Data_KoSuperNI
View on GitHub
StrategyQA 데이터 세트 번역
☆23Apr 12, 2024Updated last year
activatedgeek / calibration-tuning
View on GitHub
☆53Apr 9, 2025Updated 10 months ago
mcp-tool-bench / MCPToolBenchPP
View on GitHub
MCPToolBench++ MCP Model Context Protocol Tool Use Benchmark on AI Agent and Model Tool Use Ability
☆41Dec 17, 2025Updated 2 months ago
Tencent-Hunyuan / Hunyuan-0.5B
View on GitHub
☆53Aug 5, 2025Updated 7 months ago
ibm-granite / granite-guardian
View on GitHub
The Granite Guardian models are designed to detect risks in prompts and responses.
☆135Oct 8, 2025Updated 4 months ago
allenai / noncompliance
View on GitHub
This repository contains data, code and models for contextual noncompliance.
☆25Jul 18, 2024Updated last year
Yarayx / livelongbench
View on GitHub
The first spoken long-text dataset derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-w…
☆12Jun 28, 2025Updated 8 months ago
J-Seo / KoCommonGEN-V2
View on GitHub
KoCommonGEN v2: A Benchmark for Navigating Korean Commonsense Reasoning Challenges in Large Language Models
☆25Aug 24, 2024Updated last year
efficientscaling / Z1
View on GitHub
[EMNLP'25 Industry] Repo for "Z1: Efficient Test-time Scaling with Code"
☆68Apr 11, 2025Updated 10 months ago
jdh-algo / JoyTTS
View on GitHub
☆40Jul 15, 2025Updated 7 months ago
LLMSQL / llmsql-benchmark
View on GitHub
A Text2SQL benchmark for evaluation of Large Language Models
☆41Updated this week