xingyaoww / mint-benchView external linksLinks
Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Zihan Wang*, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng and Heng Ji.
☆133Jun 4, 2024Updated last year
Alternatives and similar repositories for mint-bench
Users that are interested in mint-bench are comparing it to the libraries listed below
Sorting:
- MetricEval: A framework that conceptualizes and operationalizes four main components of metric evaluation, in terms of reliability and va…☆12Nov 6, 2023Updated 2 years ago
- ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels …☆285Aug 19, 2023Updated 2 years ago
- [COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆73May 13, 2025Updated 9 months ago
- ☆11Jan 3, 2024Updated 2 years ago
- Code for EMNLP'24 paper - On Diversified Preferences of Large Language Model Alignment☆16Aug 6, 2024Updated last year
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.☆214Apr 15, 2025Updated 9 months ago
- Code for ACL 2023 paper "A Close Look into the Calibration of Pre-trained Language Models"☆11May 9, 2023Updated 2 years ago
- Code and dataset for the emnlp paper titled Instruct and Extract: Instruction Tuning for On-Demand Information Extraction☆54Jan 2, 2024Updated 2 years ago
- 🌟 SwarmAgent: A framework for simulating social group dynamics using multi-agent collaboration, aiding insights into collective behavior…☆12Dec 5, 2023Updated 2 years ago
- Repository containing the SPIN experiments on the DIBT 10k ranked prompts☆23Mar 12, 2024Updated last year
- Code for ICLR 2024 paper "CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets"☆60Jun 3, 2024Updated last year
- [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step☆304Apr 3, 2024Updated last year
- Public code repo for EMNLP 2024 Findings paper "MACAROON: Training Vision-Language Models To Be Your Engaged Partners"☆14Sep 28, 2024Updated last year
- 🐳 PyLoader: An asynchronous Python dataloader for loading big datasets, supporting PyTorch and TensorFlow 2.x.☆11Aug 29, 2021Updated 4 years ago
- ☆14Oct 11, 2023Updated 2 years ago
- Syntax Error-Free and Generalizable Tool Use for LLMs via Finite-State Decoding☆28Jan 28, 2024Updated 2 years ago
- Official Code Repository for [AutoScale📈: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*…☆13Aug 8, 2025Updated 6 months ago
- ☆49Aug 6, 2024Updated last year
- A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)☆3,151Nov 17, 2025Updated 2 months ago
- ☆18Jun 24, 2025Updated 7 months ago
- Code and data for "MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models"☆51Nov 18, 2025Updated 2 months ago
- Code for "Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective"☆33May 9, 2024Updated last year
- Heuristic filtering framework for RefineCode☆82Mar 13, 2025Updated 11 months ago
- [ACL2024 Findings]DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling☆18Jun 6, 2024Updated last year
- A web application for playing 20 Questions to crowdsource common sense. 🤖☆16Sep 29, 2022Updated 3 years ago
- ☆22Jan 25, 2023Updated 3 years ago
- ☆19Feb 3, 2022Updated 4 years ago
- Q-Probe: A Lightweight Approach to Reward Maximization for Language Models☆40Jun 10, 2024Updated last year
- [EMNLP 2022] Code and data for "Controllable Dialogue Simulation with In-Context Learning"☆34Feb 22, 2023Updated 2 years ago
- The official repository of "Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint"☆39Jan 12, 2024Updated 2 years ago
- ALFWorld: Aligning Text and Embodied Environments for Interactive Learning☆640Updated this week
- [NeurIPS 2023 D&B Track] Code and data for paper "Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evalua…☆36Jun 8, 2023Updated 2 years ago
- Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?☆57Apr 17, 2023Updated 2 years ago
- Code for paper "Beyond Natural Language: LLMs Leveraging Alternative Formats for Enhanced Reasoning and Communication"☆22Mar 30, 2024Updated last year
- A Domain-Specific Language, Jailbreak Attack Synthesizer and Dynamic LLM Redteaming Toolkit☆27Dec 5, 2024Updated last year
- Code and data for "Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs"☆473Mar 19, 2024Updated last year
- Repo for paper "Tell Me More! Towards Implicit User Intention Understanding of Language Model Driven Agents"☆61Feb 20, 2024Updated last year
- Code for Paper (ReMax: A Simple, Efficient and Effective Reinforcement Learning Method for Aligning Large Language Models)☆199Dec 16, 2023Updated 2 years ago
- Detect-Then-Explain Framework for Text-to-SQL task☆10Dec 6, 2023Updated 2 years ago