taisazero / socratic-debugging-benchmarkLinks
The repository contains the code and dataset for the Socratic Debugging task which is a novel task for Socratically Questioning Novice Debuggers to guide them towards discovering and fixing a buggy python program.
β18Updated last year
Alternatives and similar repositories for socratic-debugging-benchmark
Users that are interested in socratic-debugging-benchmark are comparing it to the libraries listed below
Sorting:
- β42Updated last year
- NAACL 2024. Code & Dataset for "π Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakeβ¦β39Updated 10 months ago
- [EMNLP'23] Execution-Based Evaluation for Open Domain Code Generationβ48Updated last year
- β95Updated last year
- β22Updated last year
- Sotopia-Ο: Interactive Learning of Socially Intelligent Language Agents (ACL 2024)β65Updated last year
- The LM Contamination Index is a manually created database of contamination evidences for LMs.β78Updated last year
- Evaluating the Moral Beliefs Encoded in LLMsβ26Updated 5 months ago
- Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments (EMNLP'2024)β36Updated 5 months ago
- [NeurIPS 2023] PyTorch code for Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mindβ67Updated last year
- A collection of works that investigate social agents, simulations and their real-world impact in text, embodied, and robotics contexts.β89Updated last year
- Grade-School Math with Irrelevant Context (GSM-IC) benchmark is an arithmetic reasoning dataset built upon GSM8K, by adding irrelevant seβ¦β60Updated 2 years ago
- Code, datasets, models for the paper "Automatic Evaluation of Attribution by Large Language Models"β56Updated last year
- β24Updated 9 months ago
- A Computational Framework for Behavioral Assessment of LLM Therapistsβ27Updated 7 months ago
- Token-level Reference-free Hallucination Detectionβ94Updated last year
- π» Code and benchmark for our EMNLP 2023 paper - "FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions"β55Updated last year
- Code for "From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Modβ¦β37Updated last year
- β44Updated 9 months ago
- Code and Dataset for Learning to Solve Complex Tasks by Talking to Agentsβ24Updated 3 years ago
- [ICLR 2023] Code for our paper "Selective Annotation Makes Language Models Better Few-Shot Learners"β109Updated last year
- This repository contains data, code and models for contextual noncompliance.β22Updated 10 months ago
- Code/data for MARG (multi-agent review generation)β43Updated 6 months ago
- A dataset of LLM-generated chain-of-thought steps annotated with mistake location.β81Updated 9 months ago
- A set of utilities for running few-shot prompting experiments on large-language modelsβ121Updated last year
- β26Updated last week
- Code for the paper "REV: Information-Theoretic Evaluation of Free-Text Rationales"β15Updated last year
- β24Updated last year
- β106Updated last year
- Tasks for describing differences between text distributions.β16Updated 9 months ago