eth-lre / mathtutorbenchLinks
Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors, EMNLP 2025 Oral
☆26Updated 3 weeks ago
Alternatives and similar repositories for mathtutorbench
Users that are interested in mathtutorbench are comparing it to the libraries listed below
Sorting:
- Multi-turn RL framework for aligning models to be tutors instead of answerers. EMNLP 2025 Oral☆26Updated 3 weeks ago
- 🧮 MathDial: A Dialog Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems, EMNLP Findings 2023☆70Updated 2 months ago
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them☆535Updated last year
- RewardBench: the first evaluation tool for reward models.☆667Updated 6 months ago
- Sotopia: an Open-ended Social Learning Environment (ICLR 2024 spotlight)☆265Updated this week
- This repository hosts the paper “LLM Based Math Tutoring: Challenges and Dataset”, along with the accompanying dataset. It explores the p…☆54Updated last year
- ☆341Updated 6 months ago
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆134Updated last year
- Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"☆528Updated 10 months ago
- Codes for papers on Large Language Models Personalization (LaMP)☆178Updated 9 months ago
- ☆52Updated 9 months ago
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)☆218Updated 2 years ago
- Code and data for "Lost in the Middle: How Language Models Use Long Contexts"☆366Updated last year
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆270Updated last year
- This is a collection of research papers for Self-Correcting Large Language Models with Automated Feedback.☆558Updated last year
- A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic…☆410Updated 8 months ago
- An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors☆23Updated 2 months ago
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.☆200Updated 7 months ago
- ☆627Updated 4 months ago
- An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]☆370Updated last year
- Reproducible, flexible LLM evaluations☆301Updated 3 weeks ago
- ☆293Updated last year
- Awesome LLM Self-Consistency: a curated list of Self-consistency in Large Language Models☆115Updated 4 months ago
- LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.☆760Updated last year
- Generative Judge for Evaluating Alignment☆248Updated last year
- This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.☆326Updated last year
- This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.☆534Updated last year
- ☆110Updated last year
- Github repository for "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models"☆215Updated last year
- Data and Code for Program of Thoughts [TMLR 2023]☆300Updated last year