eth-lre / mathtutorbench
Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors
☆11Updated last week
Alternatives and similar repositories for mathtutorbench:
Users that are interested in mathtutorbench are comparing it to the libraries listed below
- 🧮 MathDial: A Dialog Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems, EMNLP Findings 2023☆51Updated last month
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them☆483Updated 10 months ago
- This repository hosts the paper “LLM Based Math Tutoring: Challenges and Dataset”, along with the accompanying dataset. It explores the p…☆42Updated 7 months ago
- TruthfulQA: Measuring How Models Imitate Human Falsehoods☆721Updated 3 months ago
- ☆106Updated 11 months ago
- A collection of works that investigate social agents, simulations and their real-world impact in text, embodied, and robotics contexts.☆85Updated 10 months ago
- The repository for the survey paper <<Survey on Large Language Models Factuality: Knowledge, Retrieval and Domain-Specificity>>☆339Updated last year
- Edu-ConvoKit: An Open-Source Framework for Education Conversation Data☆92Updated last week
- ☆71Updated last year
- NAACL 2024. Code & Dataset for "🌁 Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistake…☆37Updated 9 months ago
- Code and data for "Lost in the Middle: How Language Models Use Long Contexts"☆340Updated last year
- Synthetic question-answering dataset to formally analyze the chain-of-thought output of large language models on a reasoning task.☆144Updated 6 months ago
- Data and Code for Program of Thoughts (TMLR 2023)☆269Updated 11 months ago
- Repository for Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions, ACL23☆203Updated 10 months ago
- Codes for papers on Large Language Models Personalization (LaMP)☆157Updated 2 months ago
- This repository contains the data and code introduced in the paper "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Maske…☆117Updated last year
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets☆218Updated 5 months ago
- This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.☆463Updated last year
- Github repository for "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models"☆170Updated 4 months ago
- [EMNLP 2023] Enabling Large Language Models to Generate Text with Citations. Paper: https://arxiv.org/abs/2305.14627☆480Updated 6 months ago
- ☆287Updated last year
- RewardBench: the first evaluation tool for reward models.☆555Updated 2 months ago
- Generative Judge for Evaluating Alignment☆236Updated last year
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)☆203Updated 2 years ago
- Source code of our paper MIND, ACL 2024 Long Paper☆39Updated 10 months ago
- Forward-Looking Active REtrieval-augmented generation (FLARE)☆625Updated last year
- MAD: The first work to explore Multi-Agent Debate with Large Language Models :D☆366Updated 3 months ago
- ☆35Updated 6 months ago
- ☆533Updated 3 weeks ago
- A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic…☆340Updated 2 weeks ago