eth-lre / mathtutorbenchLinks
Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors, EMNLP 2025
☆22Updated last month
Alternatives and similar repositories for mathtutorbench
Users that are interested in mathtutorbench are comparing it to the libraries listed below
Sorting:
- This repository hosts the paper “LLM Based Math Tutoring: Challenges and Dataset”, along with the accompanying dataset. It explores the p…☆54Updated last year
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them☆521Updated last year
- Multi-turn RL framework for aligning models to be tutors instead of answerers. EMNLP 2025.☆23Updated 3 months ago
- 🧮 MathDial: A Dialog Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems, EMNLP Findings 2023☆67Updated last month
- RewardBench: the first evaluation tool for reward models.☆646Updated 4 months ago
- This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.☆517Updated last year
- Codes for papers on Large Language Models Personalization (LaMP)☆175Updated 8 months ago
- Repository for Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions, ACL23☆237Updated last year
- Code and data for "Lost in the Middle: How Language Models Use Long Contexts"☆361Updated last year
- Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"☆521Updated 9 months ago
- ☆106Updated last year
- This is a collection of research papers for Self-Correcting Large Language Models with Automated Feedback.☆556Updated last year
- Reproducible, flexible LLM evaluations☆260Updated this week
- A curated list of Human Preference Datasets for LLM fine-tuning, RLHF, and eval.☆381Updated 2 years ago
- Data and Code for Program of Thoughts [TMLR 2023]☆292Updated last year
- Prod Env☆433Updated 2 years ago
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model☆553Updated 9 months ago
- This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.☆548Updated last year
- Github repository for "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models"☆205Updated 11 months ago
- A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic…☆397Updated 6 months ago
- ☆52Updated 7 months ago
- ☆293Updated last year
- LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.☆753Updated last year
- Kim, J., Evans, J., & Schein, A. (2025). Linear Representations of Political Perspective Emerge in Large Language Models. ICLR.☆21Updated 7 months ago
- A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).☆892Updated last month
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)☆214Updated 2 years ago
- Awesome LLM Self-Consistency: a curated list of Self-consistency in Large Language Models☆110Updated 3 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆259Updated last year
- NAACL 2024. Code & Dataset for "🌁 Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistake…☆44Updated last year
- [AAAI 2025] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems☆12Updated 5 months ago