taisazero / socratic-debugging-benchmark
The repository contains the code and dataset for the Socratic Debugging task which is a novel task for Socratically Questioning Novice Debuggers to guide them towards discovering and fixing a buggy python program.
☆17Updated last year
Alternatives and similar repositories for socratic-debugging-benchmark:
Users that are interested in socratic-debugging-benchmark are comparing it to the libraries listed below
- 🧮 MathDial: A Dialog Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems, EMNLP Findings 2023☆50Updated 3 weeks ago
- Code for the arXiv paper: "LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond"☆58Updated 2 months ago
- [ICLR 2023] Code for our paper "Selective Annotation Makes Language Models Better Few-Shot Learners"☆109Updated last year
- Code and Data for "Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering"☆83Updated 7 months ago
- ☆41Updated 7 months ago
- Grade-School Math with Irrelevant Context (GSM-IC) benchmark is an arithmetic reasoning dataset built upon GSM8K, by adding irrelevant se…☆58Updated 2 years ago
- DialOp: Decision-oriented dialogue environments for collaborative language agents☆106Updated 4 months ago
- ☆93Updated 10 months ago
- Language Models of Code are Few-Shot Commonsense Learners (EMNLP 2022)☆86Updated 2 years ago
- ☆86Updated last year
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆126Updated last year
- Token-level Reference-free Hallucination Detection☆94Updated last year
- Code and data accompanying our paper on arXiv "Faithful Chain-of-Thought Reasoning".☆157Updated 10 months ago
- NAACL 2024. Code & Dataset for "🌁 Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistake…☆36Updated 8 months ago
- ☆40Updated 11 months ago
- Code, datasets, models for the paper "Automatic Evaluation of Attribution by Large Language Models"☆56Updated last year
- Code for the paper "REV: Information-Theoretic Evaluation of Free-Text Rationales"☆15Updated last year
- A set of utilities for running few-shot prompting experiments on large-language models☆118Updated last year
- About The corresponding code from our paper " REFINER: Reasoning Feedback on Intermediate Representations" (EACL 2024). Do not hesitate t…☆70Updated last year
- [NeurIPS 2023] PyTorch code for Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind☆67Updated last year
- Supporting code for ReCEval paper☆28Updated 6 months ago
- ☆44Updated 7 months ago
- [ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"☆54Updated last year
- Fact-Checking the Output of Generative Large Language Models in both Annotation and Evaluation.☆89Updated last year
- RARR: Researching and Revising What Language Models Say, Using Language Models☆46Updated last year
- Inspecting and Editing Knowledge Representations in Language Models☆114Updated last year
- 🌲 Code for our EMNLP 2023 paper - 🎄 "Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Mode…☆48Updated last year
- [EMNLP-2022 Findings] Code for paper “ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback”.☆26Updated 2 years ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆73Updated last year
- ☆43Updated 9 months ago