eth-nlped / mathdial
🧮 MathDial: A Dialog Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems, EMNLP Findings 2023
☆45Updated 10 months ago
Alternatives and similar repositories for mathdial:
Users that are interested in mathdial are comparing it to the libraries listed below
- Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language Models (ACL 2024)☆45Updated 3 months ago
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆124Updated 10 months ago
- ☆23Updated last year
- [NeurIPS 2024] Train LLMs with diverse system messages reflecting individualized preferences to generalize to unseen system messages☆41Updated last month
- ☆81Updated last year
- Code and Data for "Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering"☆82Updated 5 months ago
- First explanation metric (diagnostic report) for text generation evaluation☆62Updated 6 months ago
- NAACL 2021: Are NLP Models really able to Solve Simple Math Word Problems?☆120Updated 2 years ago
- ☆20Updated last year
- ☆30Updated last year
- [ACL 2024] LangBridge: Multilingual Reasoning Without Multilingual Supervision☆83Updated 2 months ago
- 👻 Code and benchmark for our EMNLP 2023 paper - "FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions"☆52Updated 7 months ago
- The official code of TACL 2021, "Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies".☆65Updated 2 years ago
- This repository contains the dataset and code for "WiCE: Real-World Entailment for Claims in Wikipedia" in EMNLP 2023.☆40Updated last year
- HANNA, a large annotated dataset of Human-ANnotated NArratives for ASG evaluation.☆31Updated 3 months ago
- [ICLR 2023] Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners☆113Updated 4 months ago
- Benchmarking Generalization to New Tasks from Natural Language Instructions☆26Updated 3 years ago
- Github repository for "FELM: Benchmarking Factuality Evaluation of Large Language Models" (NeurIPS 2023)☆57Updated last year
- Inspecting and Editing Knowledge Representations in Language Models☆111Updated last year
- ☆70Updated 11 months ago
- IntructIR, a novel benchmark specifically designed to evaluate the instruction following ability in information retrieval models. Our foc…☆31Updated 7 months ago
- Code for the paper Code for the paper InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning☆99Updated last year
- Code and data associated with the AmbiEnt dataset in "We're Afraid Language Models Aren't Modeling Ambiguity" (Liu et al., 2023)☆57Updated 11 months ago
- Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?☆56Updated last year
- Codebase, data and models for the SummaC paper in TACL☆87Updated 2 weeks ago
- Multilingual Large Language Models Evaluation Benchmark☆115Updated 4 months ago
- Detect hallucinated tokens for conditional sequence generation.☆64Updated 2 years ago
- ☆85Updated last year
- ☆50Updated last year
- ☆45Updated last year