THUNLP-MT / StableToolBench
A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.
โ125Updated 4 months ago
Alternatives and similar repositories for StableToolBench:
Users that are interested in StableToolBench are comparing it to the libraries listed below
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Modelsโ171Updated 3 months ago
- A series of technical report on Slow Thinking with LLMโ297Updated last week
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. ๐งฎโจโ145Updated 8 months ago
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)โ250Updated 9 months ago
- This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.โ272Updated 5 months ago
- โ247Updated 5 months ago
- [COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenariosโ64Updated last month
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Modelsโ233Updated 4 months ago
- [ACL2024] Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenariosโ50Updated 9 months ago
- โ295Updated last month
- [ACL 2024] AUTOACT: Automatic Agent Learning from Scratch for QA via Self-Planningโ199Updated this week
- [EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QAโ107Updated 2 months ago
- โ119Updated last month
- Implementation of ICML 23 Paper: Specializing Smaller Language Models towards Multi-Step Reasoning.โ128Updated last year
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Zihaโฆโ110Updated 7 months ago
- Building a comprehensive and handy list of papers for GUI agentsโ163Updated last week
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)โ109Updated 2 months ago
- The related works and background techniques about Openai o1โ192Updated last week
- Implementation for the research paper "Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision".โ49Updated last month
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"โ208Updated 3 months ago
- A large-scale, fine-grained, diverse preference dataset (and models).โ325Updated last year
- InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuningโ236Updated last year
- [EMNLP 2024] The official GitHub repo for the survey paper "Knowledge Conflicts for LLMs: A Survey"โ98Updated 3 months ago
- [ICLR 2024] MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Useโ76Updated 9 months ago
- โ110Updated 3 weeks ago
- Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied witโฆโ102Updated 6 months ago
- Awesome LLM Self-Consistency: a curated list of Self-consistency in Large Language Modelsโ85Updated 5 months ago
- [NeurIPS 2024] Agent Planning with World Knowledge Modelโ98Updated last month
- Repo of paper "Free Process Rewards without Process Labels"โ94Updated this week
- Data and Code for Program of Thoughts (TMLR 2023)โ256Updated 8 months ago