icip-cas / LiveMCPBenchView external linksLinks
LiveMCPBench is a benchmark for evaluating the ability of agents to navigate and utilize a large-scale MCP toolset. It provides a comprehensive set of tasks that challenge agents to effectively use various tools in daily scenarios.
☆92Dec 18, 2025Updated last month
Alternatives and similar repositories for LiveMCPBench
Users that are interested in LiveMCPBench are comparing it to the libraries listed below
Sorting:
- Implementation of Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation. Paper: https://arxiv.org/abs/2404.06809☆21Oct 22, 2024Updated last year
- ☆12Jun 11, 2025Updated 8 months ago
- Official Implementation of HIMA (COLM'25)☆19Nov 25, 2025Updated 2 months ago
- This is the official code repository for the paper: Towards General Continuous Memory for Vision-Language Models.☆19Jul 3, 2025Updated 7 months ago
- DeepRAG: Thinking to Retrieve Step by Step for Large Language Models☆32May 17, 2025Updated 8 months ago
- Röttger et al. (2025): "MSTS: A Multimodal Safety Test Suite for Vision-Language Models"☆16Mar 31, 2025Updated 10 months ago
- ☆39Aug 4, 2025Updated 6 months ago
- Official repository for K-EXAONE built by LG AI Research☆66Feb 6, 2026Updated last week
- ☆17Apr 9, 2025Updated 10 months ago
- [ACL 2024] Making Long-Context Language Models Better Multi-Hop Reasoners☆19May 28, 2024Updated last year
- The code for "MoPE: Mixture of Prefix Experts for Zero-Shot Dialogue State Tracking"☆19Jan 25, 2025Updated last year
- ☁️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models☆19Jun 4, 2025Updated 8 months ago
- [COLM 2025] "C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing"☆20Apr 9, 2025Updated 10 months ago
- The official implementation of ICLR 2025 paper "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".☆18Apr 25, 2025Updated 9 months ago
- [COLM 2025] JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model☆25Nov 25, 2025Updated 2 months ago
- Code, Data and Model for Paper "Learning from Peers in Reasoning Models"☆27May 13, 2025Updated 9 months ago
- The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".☆30Nov 12, 2024Updated last year
- A scalable automated alignment method for large language models. Resources for "Aligning Large Language Models via Self-Steering Optimiza…☆20Nov 21, 2024Updated last year
- [ICLR 2025] Official Pytorch Implementation of "Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN" by Pengxia…☆29Jul 24, 2025Updated 6 months ago
- Gradient Boosting Models on Real-Time Sensor Data for AI-Enhanced Vehicle Predictive Maintenance. By using a web-based interface to forec…☆19Nov 17, 2024Updated last year
- ☆27Jan 22, 2025Updated last year
- Official code implementation for the ACL 2025 paper: 'Dynamic Scaling of Unit Tests for Code Reward Modeling'☆27May 16, 2025Updated 8 months ago
- HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models☆53Nov 26, 2024Updated last year
- A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models☆28Nov 25, 2024Updated last year
- StrategyQA 데이터 세트 번역☆23Apr 12, 2024Updated last year
- ☆22Dec 17, 2024Updated last year
- ☆53Apr 9, 2025Updated 10 months ago
- The Granite Guardian models are designed to detect risks in prompts and responses.☆130Oct 8, 2025Updated 4 months ago
- ☆53Aug 5, 2025Updated 6 months ago
- This repository contains data, code and models for contextual noncompliance.☆25Jul 18, 2024Updated last year
- [arxiv: 2512.19673] Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies☆59Feb 6, 2026Updated last week
- KoCommonGEN v2: A Benchmark for Navigating Korean Commonsense Reasoning Challenges in Large Language Models☆25Aug 24, 2024Updated last year
- A Text2SQL benchmark for evaluation of Large Language Models☆41Updated this week
- [EMNLP'25 Industry] Repo for "Z1: Efficient Test-time Scaling with Code"☆68Apr 11, 2025Updated 10 months ago
- The first spoken long-text dataset derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-w…☆13Jun 28, 2025Updated 7 months ago
- ☆40Jul 15, 2025Updated 6 months ago
- [ICML'24] TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks☆31Sep 20, 2024Updated last year
- [EMNLP 2025 Main] LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL☆60Jun 18, 2025Updated 7 months ago
- Official code of "RoboOmni: Proactive Robot Manipulation in Omni-modal Context"☆81Nov 17, 2025Updated 2 months ago