MTU-Bench-Team/MTU-Bench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/MTU-Bench-Team/MTU-Bench)

MTU-Bench-Team / MTU-Bench

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

☆60

Alternatives and similar repositories for MTU-Bench

Users that are interested in MTU-Bench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

ShopAgent-Team / ShopSimulator
View on GitHub
☆17Jan 27, 2026Updated 6 months ago
multimodal-art-projection / CodeCriticBench
View on GitHub
☆16Nov 1, 2025Updated 8 months ago
MDI-Benchmark / MDI-Benchmark
View on GitHub
☆14Dec 18, 2024Updated last year
RUC-NLPIR / ET-Agent
View on GitHub
☆20Jan 18, 2026Updated 6 months ago
SimpleVQA / SimpleVQA
View on GitHub
SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
☆15Feb 20, 2025Updated last year
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
SWE-Gym / SWE-Bench-Fork
View on GitHub
☆13Mar 5, 2025Updated last year
OpenMOSS / VehicleWorld
View on GitHub
VehicleWorld is the first comprehensive multi-device environment for intelligent vehicle interaction that accurately models the complex, …
☆24Sep 16, 2025Updated 10 months ago
RUC-NLPIR / EnvScaler
View on GitHub
The official implementation of "EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis".
☆179Feb 12, 2026Updated 5 months ago
MCEVAL / McEval
View on GitHub
☆48Dec 12, 2024Updated last year
NEUIR / COAST
View on GitHub
Official repository for the paper "COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis".
☆18Feb 19, 2025Updated last year
junfeng0288 / MathReal
View on GitHub
☆15Aug 11, 2025Updated 11 months ago
multimodal-art-projection / KORGym
View on GitHub
☆60May 21, 2025Updated last year
kwaipilot / SWE-Compass
View on GitHub
☆18Mar 28, 2026Updated 4 months ago
Qwen-Applications / SSP
View on GitHub
Search Self-Play: Pushing the Frontier of Agent Capability without Supervision
☆20Dec 30, 2025Updated 6 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
RUC-NLPIR / ClawTrojan
View on GitHub
From Prompt Injection to Persistent Control: Defending Agentic Workspaces Against Trojan Backdoors
☆18Jun 1, 2026Updated last month
WadeYin9712 / UI-Simulator
View on GitHub
Code for 🌍 UI-Simulator: LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training
☆21Oct 17, 2025Updated 9 months ago
llm-in-sandbox / llm-in-sandbox
View on GitHub
Computer Environments Elicit General Agentic Intelligence in LLMs
☆239Jul 21, 2026Updated last week
yym6472 / bert_slot_tagging
View on GitHub
用预训练BERT实现序列标注模型。
☆14Sep 29, 2020Updated 5 years ago
THUNLP-MT / StableToolBench
View on GitHub
A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.
☆238Apr 15, 2025Updated last year
Quehry / HelloBench
View on GitHub
HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models
☆60Nov 26, 2024Updated last year
Junjie-Ye / ToolEyes
View on GitHub
[COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
☆74May 13, 2025Updated last year
WillDreamer / ARL-Arena
View on GitHub
[ICML2026] ARLArena
☆90May 2, 2026Updated 2 months ago
ChestnutWYN / ACL2021-Novel-Slot-Detection
View on GitHub
☆17Jul 9, 2021Updated 5 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
microsoft / prose-benchmarks
View on GitHub
PROSE Public Benchmark Suite
☆35Sep 15, 2025Updated 10 months ago
euReKa025 / AgentLongBench
View on GitHub
☆22Jan 29, 2026Updated 6 months ago
yxzwang / FamilyTool
View on GitHub
FamilyTool benchmark
☆14Sep 10, 2025Updated 10 months ago
ayaabdelsalam91 / saliency_guided_training
View on GitHub
☆13Nov 29, 2021Updated 4 years ago
YuyaoZhangQAQ / QCompiler
View on GitHub
This repository contains the code for the paper “Neuro-Symbolic Query Compiler”, accepted to the Findings of ACL 2025.
☆17Oct 20, 2025Updated 9 months ago
bytedance / FullStackBench
View on GitHub
Official repository for our paper "FullStack Bench: Evaluating LLMs as Full Stack Coders"
☆122May 7, 2025Updated last year
xinghaow99 / prism
View on GitHub
[ICML 2026] Prism: Spectral-Aware Block-Sparse Attention
☆27May 22, 2026Updated 2 months ago
quchangle1 / LLM-Tool-Survey
View on GitHub
This is the repository for the Tool Learning survey.
☆486Aug 9, 2025Updated 11 months ago
UKPLab / arxiv2025-inherent-limits-plms
View on GitHub
Code repository for the paper "The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Le…
☆14Jan 16, 2025Updated last year
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
Hambaobao / SWE-Flow
View on GitHub
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner
☆40Jun 29, 2025Updated last year
conceptmath / conceptmath
View on GitHub
[ACL 2024 Findings] The official repo for "ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large …
☆26May 29, 2024Updated 2 years ago
ignorejjj / LongRefiner
View on GitHub
The code for paper: Hierarchical Document Refinement for Long-context Retrieval-augmented Generation [ACL2025 Oral]
☆47Aug 25, 2025Updated 11 months ago
MindLab-Research / longstraw
View on GitHub
MinT-2M: Long-context training system for resident-prefix GRPO
☆39Updated this week
Arvid-pku / ATOKE
View on GitHub
[AAAI 2024] History Matters: Temporal Knowledge Editing in Large Language Model
☆13Dec 17, 2023Updated 2 years ago
sophgo / sophon-pipeline
View on GitHub
☆44Jul 5, 2024Updated 2 years ago
myt517 / DKT
View on GitHub
Official implementation of "Disentangled Knowledge Transfer for OOD Intent Discovery with Unified Contrastive Learning", ACL2022 main con…
☆14Jul 23, 2022Updated 4 years ago