lfy79001 / S3EvalLinks

[NAACL 2024] A Synthetic, Scalable and Systematic Evaluation Suite for Large Language Models

☆32

Alternatives and similar repositories for S3Eval

Users that are interested in S3Eval are comparing it to the libraries listed below

Sorting:

ChengpengLi1003 / DotaMath
☆30Updated 6 months ago
ernie-research / Tool-Augmented-Reward-Model
[ICLR'24 spotlight] Tool-Augmented Reward Modeling
☆50Updated last month
CriticBench / CriticBench
[ACL 2024 Findings] CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
☆27Updated last year
Yifan-Song793 / GoodBadGreedy
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
☆30Updated 11 months ago
qtli / GSM-Plus
GSM-Plus: Data, Code, and Evaluation for Enhancing Robust Mathematical Reasoning in Math Word Problems.
☆62Updated last year
OpenMOSS / Say-I-Dont-Know
[ICML'2024] Can AI Assistants Know What They Don't Know?
☆81Updated last year
dqxiu / KAssess
☆14Updated last year
RUCAIBox / JiuZhang3.0
The code and data for the paper JiuZhang3.0
☆47Updated last year
wwxu21 / CUT
Source code of "Reasons to Reject? Aligning Language Models with Judgments"
☆58Updated last year
GAIR-NLP / weak-to-strong-reasoning
☆59Updated 10 months ago
qiancheng0 / CREATOR
This is the repository for paper "CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models"
☆25Updated last year
HillZhang1999 / ICD
Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"
☆66Updated last year
GAIR-NLP / MetaCritique
Evaluate the Quality of Critique
☆36Updated last year
vickywu1022 / OntoProbe-PLMs
Repo for outstanding paper@ACL 2023 "Do PLMs Know and Understand Ontological Knowledge?"
☆32Updated last year
yyDing1 / ScaleQuest
[ACL-25] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLMs.
☆63Updated 8 months ago
RUCAIBox / BAMBOO
☆35Updated last year
Di-viner / LLM-Robustness-to-Irrelevant-Information
[COLM'24] "How Easily do Irrelevant Inputs Skew the Responses of Large Language Models?"
☆22Updated 9 months ago
dvlab-research / Mr-Ben
This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"
☆50Updated 8 months ago
kkk-an / COFFTEA
Code for Findings of EMNLP2023 paper "Coarse-to-Fine Dual Encoders are Better Frame Identification Learners"
☆12Updated last year
zhaochen0110 / Cotempqa
Code and data for "Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning?" (ACL 2024)
☆32Updated last year
TobiasLee / VEC
Visual and Embodied Concepts evaluation benchmark
☆21Updated last year
Zce1112zslx / IKE
☆41Updated last year
cxcscmu / MATES
Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
☆71Updated 8 months ago
lqtrung1998 / mwp_cot_design
☆13Updated last year
TianHongZXY / CoRe
[ACL 2023] Solving Math Word Problems via Cooperative Reasoning induced Language Models (LLMs + MCTS + Self-Improvement)
☆49Updated last year
OpenBMB / CPO
☆22Updated last year
siyuyuan / coscript
Resources for our ACL 2023 paper: Distilling Script Knowledge from Large Language Models for Constrained Language Planning
☆36Updated last year
NanshineLoong / Self-Evolving-Benchmark
A framework for evolving and testing question-answering datasets with various models.
☆16Updated last year
GAIR-NLP / ReasonEval
[AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy
☆63Updated 7 months ago
rookie-joe / AutoPSV
☆46Updated 8 months ago