THUNLP-MT / StableToolBench
A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.
☆101Updated this week
Related projects: ⓘ
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆148Updated 6 months ago
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆100Updated 3 months ago
- [ICML 2024] Selecting High-Quality Data for Training Language Models☆134Updated 2 months ago
- Implementation of ICML 23 Paper: Specializing Smaller Language Models towards Multi-Step Reasoning.☆119Updated last year
- InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning☆196Updated last year
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models☆208Updated last week
- [ACL 2024] Long-Context Language Modeling with Parallel Encodings☆133Updated 3 months ago
- ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆63Updated 5 months ago
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)☆230Updated 5 months ago
- [ACL2024] Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios☆36Updated 5 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆72Updated 4 months ago
- 🐋 An unofficial implementation of Self-Alignment with Instruction Backtranslation.☆128Updated 2 months ago
- ☆185Updated last month
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆104Updated 2 months ago
- [ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning☆337Updated 2 months ago
- Project for the paper entitled `Instruction Tuning for Large Language Models: A Survey`☆134Updated 6 months ago
- [arxiv:2406.17419]Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA☆62Updated last month
- [ACL 2024] AUTOACT: Automatic Agent Learning from Scratch for QA via Self-Planning☆162Updated 5 months ago
- [ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning☆101Updated last week
- Repository for the paper "Cognitive Mirage: A Review of Hallucinations in Large Language Models"☆47Updated 10 months ago
- Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718☆244Updated last week
- Generative Judge for Evaluating Alignment☆208Updated 8 months ago
- A large-scale, fine-grained, diverse preference dataset (and models).☆299Updated 8 months ago
- ☆71Updated 8 months ago
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning"☆87Updated 2 months ago
- Achieving Efficient Alignment through Learned Correction☆103Updated 3 months ago
- Repository for Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions, ACL23☆151Updated 3 months ago
- [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues☆38Updated last month
- Code and data for the paper "Can Large Language Models Understand Real-World Complex Instructions?"(AAAI2024)☆42Updated 5 months ago
- Code for "FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)"☆80Updated last week