LiveMCPBench is a benchmark for evaluating the ability of agents to navigate and utilize a large-scale MCP toolset. It provides a comprehensive set of tasks that challenge agents to effectively use various tools in daily scenarios.
☆100Dec 18, 2025Updated 5 months ago
Alternatives and similar repositories for LiveMCPBench
Users that are interested in LiveMCPBench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Implementation of Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation. Paper: https://arxiv.org/abs/2404.06809☆22Oct 22, 2024Updated last year
- ☆12Jun 13, 2025Updated 11 months ago
- This is the official code repository for the paper: Towards General Continuous Memory for Vision-Language Models.☆27Jul 3, 2025Updated 10 months ago
- ☆11Jun 11, 2025Updated 11 months ago
- [ACL 2024] Making Long-Context Language Models Better Multi-Hop Reasoners☆20May 28, 2024Updated 2 years ago
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- llms related stuff , including code, docs☆13Feb 25, 2025Updated last year
- MCPToolBench++ MCP Model Context Protocol Tool Use Benchmark on AI Agent and Model Tool Use Ability☆44Mar 17, 2026Updated 2 months ago
- A scalable automated alignment method for large language models. Resources for "Aligning Large Language Models via Self-Steering Optimiza…☆20Nov 21, 2024Updated last year
- Code base for "Target-Side Augmentation for Document-Level Machine Translation"☆15Aug 15, 2023Updated 2 years ago
- Repository of paper "How Likely Do LLMs with CoT Mimic Human Reasoning?"☆23Feb 19, 2025Updated last year
- A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models☆30Nov 25, 2024Updated last year
- [COLM 2025] JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model☆28Nov 25, 2025Updated 6 months ago
- The official implementation of ICLR 2025 paper "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".☆18Apr 25, 2025Updated last year
- The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".☆30Nov 12, 2024Updated last year
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- [ICLR 2025] Official Pytorch Implementation of "Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN" by Pengxia…☆30Jul 24, 2025Updated 10 months ago
- Official code of "RoboOmni: Proactive Robot Manipulation in Omni-modal Context"☆108Mar 28, 2026Updated 2 months ago
- Code, Data and Model for Paper "Learning from Peers in Reasoning Models"☆27May 13, 2025Updated last year
- [COLM 2025: 1st Workshop on the Application of LLM Explainability to Reasoning and Planning] Latent Chain-of-Thought? Decoding the Depth-…☆18Oct 4, 2025Updated 7 months ago
- This is for EMNLP 2024 Paper: AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction☆16Nov 4, 2024Updated last year
- ☆45Jun 19, 2025Updated 11 months ago
- P1: Mastering Physics Olympiads with Reinforcement Learning☆85Dec 29, 2025Updated 5 months ago
- For <Does It Make Sense? And Why? A Pilot Study for Sense Making and Explanation>. Accepted by ACL2019☆26Oct 23, 2020Updated 5 years ago
- ☆40Jul 15, 2025Updated 10 months ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- ☆17Apr 9, 2025Updated last year
- RACE is a multi-dimensional benchmark for code generation that focuses on Readability, mAintainability, Correctness, and Efficiency.☆14Oct 12, 2024Updated last year
- HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models☆57Nov 26, 2024Updated last year
- Javascript wrapper bindings for diamond types☆13Sep 13, 2021Updated 4 years ago
- DOMAINEVAL is an auto-constructed benchmark for multi-domain code generation that consists of 2k+ subjects (i.e., description, reference …☆13Dec 12, 2024Updated last year
- This repository contains data and code used for On the Risk of Misinformation Pollution with Large Language Models (EMNLP 2023 Findings).☆17Dec 14, 2023Updated 2 years ago
- awesome nlp resource☆64May 19, 2021Updated 5 years ago
- ☆13Aug 23, 2017Updated 8 years ago
- Repository for DISRPT2019 shared task☆12Sep 5, 2022Updated 3 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Sotopia-RL: Reward Design for Social Intelligence☆50Apr 1, 2026Updated last month
- [ICML'24] TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks☆33Sep 20, 2024Updated last year
- ☆20May 23, 2025Updated last year
- [arxiv: 2512.19673] Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies☆60Feb 6, 2026Updated 3 months ago
- Code and data for "Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change" (EMNLP2022)☆18Dec 8, 2022Updated 3 years ago
- code for "Fine-grained Entity Typing via Label Reasoning" EMNLP2021☆13May 27, 2022Updated 4 years ago
- ☆15Feb 26, 2025Updated last year