MTU-Bench-Team / MTU-Bench
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models
☆42Updated 2 months ago
Alternatives and similar repositories for MTU-Bench:
Users that are interested in MTU-Bench are comparing it to the libraries listed below
- ☆47Updated 4 months ago
- Official codebase for "GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning".☆72Updated 2 weeks ago
- ☆102Updated 5 months ago
- ☆132Updated 2 weeks ago
- The demo, code and data of FollowRAG☆72Updated 2 weeks ago
- Knowledge-Reasoning Synergy Reinforcement Learning.☆35Updated 2 months ago
- ☆55Updated 6 months ago
- SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis☆41Updated 2 weeks ago
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling☆101Updated 3 months ago
- [ICLR 2025] Benchmarking Agentic Workflow Generation☆89Updated 2 months ago
- MPO: Boosting LLM Agents with Meta Plan Optimization☆51Updated 2 months ago
- ☆42Updated 2 months ago
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluations☆94Updated 3 weeks ago
- Reformatted Alignment☆115Updated 7 months ago
- [COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆65Updated 5 months ago
- Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems☆90Updated 2 months ago
- ☆153Updated last month
- ☆151Updated 4 months ago
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)☆138Updated 6 months ago
- Watch Every Step! LLM Agent Learning via Iterative Step-level Process Refinement (EMNLP 2024 Main Conference)☆57Updated 6 months ago
- Hammer: Robust Function-Calling for On-Device Language Models via Function Masking☆76Updated 2 months ago
- The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"☆53Updated last year
- We aim to provide the best references to search, select, and synthesize high-quality and large-quantity data for post-training your LLMs.☆54Updated 7 months ago
- Critique-out-Loud Reward Models☆64Updated 6 months ago
- [preprint] We propose a novel fine-tuning method, Separate Memory and Reasoning, which combines prompt tuning with LoRA.☆44Updated 4 months ago
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learning☆93Updated this week
- We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLMs.☆62Updated 6 months ago
- [EMNLP 2024] Source code for the paper "Learning Planning-based Reasoning with Trajectory Collection and Process Rewards Synthesizing".☆76Updated 3 months ago
- Implementation for the research paper "Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision".☆52Updated 5 months ago
- ☆49Updated last year