LiveMCPBench is a benchmark for evaluating the ability of agents to navigate and utilize a large-scale MCP toolset. It provides a comprehensive set of tasks that challenge agents to effectively use various tools in daily scenarios.
☆93Dec 18, 2025Updated 2 months ago
Alternatives and similar repositories for LiveMCPBench
Users that are interested in LiveMCPBench are comparing it to the libraries listed below
Sorting:
- Implementation of Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation. Paper: https://arxiv.org/abs/2404.06809☆22Oct 22, 2024Updated last year
- Official Implementation of HIMA (COLM'25)☆19Nov 25, 2025Updated 3 months ago
- ☆12Jun 11, 2025Updated 8 months ago
- DeepSeek Essentials, published by Packt☆29Jan 27, 2026Updated last month
- DeepRAG: Thinking to Retrieve Step by Step for Large Language Models☆33Feb 17, 2026Updated 2 weeks ago
- This is the official code repository for the paper: Towards General Continuous Memory for Vision-Language Models.☆21Jul 3, 2025Updated 8 months ago
- Röttger et al. (2025): "MSTS: A Multimodal Safety Test Suite for Vision-Language Models"☆16Mar 31, 2025Updated 11 months ago
- ☆40Aug 4, 2025Updated 7 months ago
- ☆17Apr 9, 2025Updated 10 months ago
- Official repository for K-EXAONE built by LG AI Research☆69Feb 6, 2026Updated last month
- [ACL 2024] Making Long-Context Language Models Better Multi-Hop Reasoners☆19May 28, 2024Updated last year
- ☁️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models☆19Jun 4, 2025Updated 9 months ago
- The code for "MoPE: Mixture of Prefix Experts for Zero-Shot Dialogue State Tracking"☆19Jan 25, 2025Updated last year
- [COLM 2025] JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model☆25Nov 25, 2025Updated 3 months ago
- [COLM 2025] "C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing"☆20Apr 9, 2025Updated 10 months ago
- Harnessing Ollama - Create Secure Local LLM Solutions with Python, Published by Packt Publishing☆22Dec 19, 2024Updated last year
- The official implementation of ICLR 2025 paper "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".☆18Apr 25, 2025Updated 10 months ago
- Code, Data and Model for Paper "Learning from Peers in Reasoning Models"☆27May 13, 2025Updated 9 months ago
- Evaluate gpt-4o on CLIcK (Korean NLP Dataset)☆20May 18, 2024Updated last year
- The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".☆30Nov 12, 2024Updated last year
- A scalable automated alignment method for large language models. Resources for "Aligning Large Language Models via Self-Steering Optimiza…☆20Nov 21, 2024Updated last year
- Official code implementation for the ACL 2025 paper: 'Dynamic Scaling of Unit Tests for Code Reward Modeling'☆27May 16, 2025Updated 9 months ago
- ☆27Jan 22, 2025Updated last year
- [ICLR 2025] Official Pytorch Implementation of "Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN" by Pengxia…☆29Jul 24, 2025Updated 7 months ago
- Gradient Boosting Models on Real-Time Sensor Data for AI-Enhanced Vehicle Predictive Maintenance. By using a web-based interface to forec…☆19Nov 17, 2024Updated last year
- HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models☆53Nov 26, 2024Updated last year
- ☆23Dec 17, 2024Updated last year
- A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models☆28Nov 25, 2024Updated last year
- StrategyQA 데이터 세트 번역☆23Apr 12, 2024Updated last year
- ☆53Apr 9, 2025Updated 10 months ago
- MCPToolBench++ MCP Model Context Protocol Tool Use Benchmark on AI Agent and Model Tool Use Ability☆41Dec 17, 2025Updated 2 months ago
- ☆53Aug 5, 2025Updated 7 months ago
- The Granite Guardian models are designed to detect risks in prompts and responses.☆135Oct 8, 2025Updated 4 months ago
- This repository contains data, code and models for contextual noncompliance.☆25Jul 18, 2024Updated last year
- The first spoken long-text dataset derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-w…☆12Jun 28, 2025Updated 8 months ago
- KoCommonGEN v2: A Benchmark for Navigating Korean Commonsense Reasoning Challenges in Large Language Models☆25Aug 24, 2024Updated last year
- [EMNLP'25 Industry] Repo for "Z1: Efficient Test-time Scaling with Code"☆68Apr 11, 2025Updated 10 months ago
- ☆40Jul 15, 2025Updated 7 months ago
- A Text2SQL benchmark for evaluation of Large Language Models☆41Updated this week