taisazero / socratic-debugging-benchmark
The repository contains the code and dataset for the Socratic Debugging task which is a novel task for Socratically Questioning Novice Debuggers to guide them towards discovering and fixing a buggy python program.
โ17Updated last year
Alternatives and similar repositories for socratic-debugging-benchmark:
Users that are interested in socratic-debugging-benchmark are comparing it to the libraries listed below
- NAACL 2024. Code & Dataset for "๐ Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakeโฆโ37Updated 9 months ago
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"โ54Updated last year
- โ23Updated 11 months ago
- Supporting code for ReCEval paperโ28Updated 7 months ago
- ๐งฎ MathDial: A Dialog Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems, EMNLP Findings 2023โ52Updated 2 months ago
- PACIFIC: Towards Proactive Conversational Question Answering over Tabular and Textual Data in Financeโ14Updated 11 months ago
- โ42Updated 9 months ago
- Public repository for "Think Twice: Perspective-Taking Improves Large Language Modelsโ Theory-of-Mind Capabilities".โ19Updated last year
- Grade-School Math with Irrelevant Context (GSM-IC) benchmark is an arithmetic reasoning dataset built upon GSM8K, by adding irrelevant seโฆโ60Updated 2 years ago
- Code for the paper "REV: Information-Theoretic Evaluation of Free-Text Rationales"โ15Updated last year
- [NeurIPS 2023] PyTorch code for Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mindโ67Updated last year
- โ41Updated last year
- Code/data for MARG (multi-agent review generation)โ43Updated 5 months ago
- Byte-sized text games for code generation tasks on virtual environmentsโ19Updated 10 months ago
- [NeurIPS 2024] Train LLMs with diverse system messages reflecting individualized preferences to generalize to unseen system messagesโ45Updated 5 months ago
- Code and Data for "Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering"โ83Updated 8 months ago
- ๐ป Code and benchmark for our EMNLP 2023 paper - "FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions"โ55Updated 11 months ago
- Investigating Cultural Alignment of Large Language Modelsโ11Updated 8 months ago
- Generating diverse counterfactual data for Natural Language Understanding tasks using Large Language Models (LLMs). The generator supportโฆโ36Updated last year
- Implementation of the Paper "Goal-Driven Explainable Clustering via Language Descriptions"โ36Updated last year
- โ47Updated 11 months ago
- Data and code for the paper "NormBank: A Knowledge Bank of Situational Social Norms"โ27Updated last year
- About The corresponding code from our paper " REFINER: Reasoning Feedback on Intermediate Representations" (EACL 2024). Do not hesitate tโฆโ70Updated last year
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generationโ48Updated last year
- This repository contains data, code and models for contextual noncompliance.โ22Updated 9 months ago
- Code, datasets, models for the paper "Automatic Evaluation of Attribution by Large Language Models"โ56Updated last year
- [arXiv preprint] Official Repository for "Evaluating Language Models as Synthetic Data Generators"โ33Updated 4 months ago
- Code and Dataset for Learning to Solve Complex Tasks by Talking to Agentsโ24Updated 2 years ago
- HANNA, a large annotated dataset of Human-ANnotated NArratives for ASG evaluation.โ33Updated 6 months ago
- Language Models of Code are Few-Shot Commonsense Learners (EMNLP 2022)โ86Updated 2 years ago