THUNLP-MT / StableToolBench
A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.
☆129Updated 5 months ago
Alternatives and similar repositories for StableToolBench:
Users that are interested in StableToolBench are comparing it to the libraries listed below
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆175Updated 4 months ago
- [COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆65Updated 2 months ago
- This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.☆289Updated 6 months ago
- [EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA☆111Updated 3 months ago
- Generative Judge for Evaluating Alignment☆228Updated last year
- ☆257Updated 6 months ago
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆112Updated 8 months ago
- ☆130Updated 2 months ago
- [ACL 2024] AUTOACT: Automatic Agent Learning from Scratch for QA via Self-Planning☆206Updated last month
- [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues☆68Updated 6 months ago
- A large-scale, fine-grained, diverse preference dataset (and models).☆329Updated last year
- [ACL2024] Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios☆52Updated 10 months ago
- ☆319Updated 2 weeks ago
- Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)☆116Updated 3 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆172Updated 9 months ago
- A series of technical report on Slow Thinking with LLM☆409Updated last week
- [EMNLP 2024] Source code for the paper "Learning Planning-based Reasoning with Trajectory Collection and Process Rewards Synthesizing".☆68Updated last month
- Implementation for the research paper "Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision".☆52Updated 2 months ago
- Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718☆307Updated 4 months ago
- InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning☆240Updated last year
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]☆281Updated 9 months ago
- Collection of papers for scalable automated alignment.☆82Updated 3 months ago
- [ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning☆141Updated 5 months ago
- 🐋 An unofficial implementation of Self-Alignment with Instruction Backtranslation.☆136Updated 7 months ago
- The related works and background techniques about Openai o1☆210Updated last month
- Repo of paper "Free Process Rewards without Process Labels"☆123Updated last month
- ☆20Updated this week
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models☆245Updated 5 months ago
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)☆256Updated 10 months ago
- OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning☆106Updated last month