eth-nlped / mathdialLinks
๐งฎ MathDial: A Dialog Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems, EMNLP Findings 2023
โ54Updated 3 months ago
Alternatives and similar repositories for mathdial
Users that are interested in mathdial are comparing it to the libraries listed below
Sorting:
- NAACL 2021: Are NLP Models really able to Solve Simple Math Word Problems?โ129Updated 2 years ago
- โ42Updated last year
- First explanation metric (diagnostic report) for text generation evaluationโ62Updated 3 months ago
- โ48Updated last year
- NAACL 2024. Code & Dataset for "๐ Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakeโฆโ39Updated 10 months ago
- โ177Updated 2 years ago
- โ86Updated 2 years ago
- Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language Models (ACL 2024)โ53Updated 3 weeks ago
- Awesome LLM for NLG Evaluation Papersโ24Updated last year
- Inspecting and Editing Knowledge Representations in Language Modelsโ116Updated last year
- Repository for the Bias Benchmark for QA dataset.โ116Updated last year
- [NAACL 2024 Outstanding Paper] Source code for the NAACL 2024 paper entitled "R-Tuning: Instructing Large Language Models to Say 'I Don'tโฆโ111Updated 10 months ago
- Awesome LLM Self-Consistency: a curated list of Self-consistency in Large Language Modelsโ97Updated 9 months ago
- โ82Updated 2 years ago
- โ18Updated last year
- โ42Updated 10 months ago
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answersโ128Updated last year
- [NeurIPS 2024] Train LLMs with diverse system messages reflecting individualized preferences to generalize to unseen system messagesโ47Updated 6 months ago
- โ33Updated 2 years ago
- [arXiv preprint] Official Repository for "Evaluating Language Models as Synthetic Data Generators"โ33Updated 5 months ago
- WikiWhy is a new benchmark for evaluating LLMs' ability to explain between cause-effect relationships. It is a QA dataset containing 9000โฆโ47Updated last year
- [ICLR'24 Spotlight] "Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts"โ68Updated last year
- โ36Updated 2 years ago
- This code accompanies the paper DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering.โ17Updated 2 years ago
- ๐ป Code and benchmark for our EMNLP 2023 paper - "FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions"โ55Updated last year
- โ21Updated 3 years ago
- HANNA, a large annotated dataset of Human-ANnotated NArratives for ASG evaluation.โ33Updated 7 months ago
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Followingโ127Updated 10 months ago
- Github repository for "FELM: Benchmarking Factuality Evaluation of Large Language Models" (NeurIPS 2023)โ59Updated last year
- โ21Updated 2 years ago