open-compass / T-EvalLinks

[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step

☆294

Alternatives and similar repositories for T-Eval

Users that are interested in T-Eval are comparing it to the libraries listed below

Sorting:

InternLM / Agent-FLAN
[ACL2024 Findings] Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models
☆354Updated last year
QwenLM / AutoIF
☆312Updated last year
OFA-Sys / InsTag
InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning
☆278Updated 2 years ago
open-compass / BotChat
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
☆158Updated 5 months ago
hkust-nlp / deita
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆572Updated 10 months ago
tianyi-lab / Cherry_LLM
[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other mo…
☆398Updated 4 months ago
X-PLUG / Multi-LLM-Agent
☆233Updated last year
GAIR-NLP / ProX
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
☆263Updated 3 months ago
THUNLP-MT / StableToolBench
A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.
☆192Updated 6 months ago
GAIR-NLP / auto-j
Generative Judge for Evaluating Alignment
☆247Updated last year
OpenBMB / UltraEval
[ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.
☆251Updated last year
THUDM / LongAlign
[EMNLP 2024] LongAlign: A Recipe for Long Context Alignment of LLMs
☆257Updated 10 months ago
thu-coai / CritiqueLLM
☆147Updated last year
qiancheng0 / ToolRL
☆367Updated 2 weeks ago
chenchen0103 / ACEBench
☆129Updated last week
modelscope / Trinity-RFT
Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models (…
☆379Updated this week
wjn1996 / Awesome-LLM-Reasoning-Openai-o1-Survey
The related works and background techniques about Openai o1
☆223Updated 9 months ago
a-m-team / a-m-models
a-m-team's exploration in large language modeling
☆190Updated 5 months ago
zjunlp / AutoAct
[ACL 2024] AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning
☆229Updated 9 months ago
dvlab-research / Step-DPO
Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"
☆385Updated 9 months ago
lqtrung1998 / mwp_ReFT
☆548Updated 9 months ago
thu-coai / BPO
☆330Updated last year
MARIO-Math-Reasoning / Super_MARIO
☆342Updated 4 months ago
CASIA-LM / MoDS
☆145Updated last year
OpenBMB / UltraFeedback
A large-scale, fine-grained, diverse preference dataset (and models).
☆354Updated last year
anchen1011 / FireAct
FireAct: Toward Language Agent Fine-tuning
☆283Updated 2 years ago
OFA-Sys / gsm8k-ScRel
Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
☆266Updated last year
RUCAIBox / Slow_Thinking_with_LLMs
A series of technical report on Slow Thinking with LLM
☆744Updated 2 months ago
Junjie-Ye / ToolEyes
[COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
☆69Updated 5 months ago
tianyi-lab / Superfiltering
[ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
☆180Updated 4 months ago