psunlpgroup / ReaLMistakeLinks

This repository includes a benchmark and code for the paper "Evaluating LLMs at Detecting Errors in LLM Responses".

☆30

Alternatives and similar repositories for ReaLMistake

Users that are interested in ReaLMistake are comparing it to the libraries listed below

Sorting:

OSU-NLP-Group / llm-planning-eval
[ACL'24] Code and data of paper "When is Tree Search Useful for LLM Planning? It Depends on the Discriminator"
☆54Updated last year
XiangLi1999 / AutoBencher
☆29Updated last year
GAIR-NLP / MetaCritique
Evaluate the Quality of Critique
☆36Updated last year
csitfun / LogiCoT
the instructions and demonstrations for building a formal logical reasoning capable GLM
☆53Updated 11 months ago
GAIR-NLP / scaleeval
Scalable Meta-Evaluation of LLMs as Evaluators
☆42Updated last year
UKPLab / acl2025-diverse-cot
Code for the 2025 ACL publication "Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs"
☆29Updated last month
LuLuLuyi / LongHeads
[EMNLP'24] LongHeads: Multi-Head Attention is Secretly a Long Context Processor
☆29Updated last year
epfl-dlab / llm-latent-language
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
☆78Updated last year
DAMO-NLP-SG / contrastive-cot
Contrastive Chain-of-Thought Prompting
☆67Updated last year
google-research-datasets / GSM-IC
Grade-School Math with Irrelevant Context (GSM-IC) benchmark is an arithmetic reasoning dataset built upon GSM8K, by adding irrelevant se…
☆60Updated 2 years ago
Zayne-sprague / MuSR
☆49Updated 11 months ago
WHGTyen / BIG-Bench-Mistake
A dataset of LLM-generated chain-of-thought steps annotated with mistake location.
☆81Updated 11 months ago
Nanami18 / Snowballed_Hallucination
☆45Updated 11 months ago
QingruZhang / PASTA
PASTA: Post-hoc Attention Steering for LLMs
☆122Updated 8 months ago
wwxu21 / CUT
Source code of "Reasons to Reject? Aligning Language Models with Judgments"
☆58Updated last year
GAIR-NLP / benbench
Benchmarking Benchmark Leakage in Large Language Models
☆55Updated last year
Zce1112zslx / IKE
☆41Updated last year
tml-epfl / icl-alignment
Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]
☆31Updated 6 months ago
abhika-m / FAVA
☆73Updated last year
yuzhaouoe / pretraining-data-packing
[ACL'24 Oral] Analysing The Impact of Sequence Composition on Language Model Pre-Training
☆22Updated 11 months ago
allenai / noncompliance
This repository contains data, code and models for contextual noncompliance.
☆23Updated last year
RUCAIBox / BAMBOO
☆35Updated last year
ernie-research / Tool-Augmented-Reward-Model
[ICLR'24 spotlight] Tool-Augmented Reward Modeling
☆51Updated 2 months ago
nkandpa2 / long_tail_knowledge
Repo for the paper "Large Language Models Struggle to Learn Long-Tail Knowledge"
☆77Updated 2 years ago
eladsegal / strategyqa
The official code of TACL 2021, "Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies".
☆76Updated 2 years ago
princeton-nlp / LM-Science-Tutor
☆43Updated last year
Re-Align / just-eval
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
☆85Updated last year
Strong-AI-Lab / Logical-and-abstract-reasoning
Evaluation on Logical Reasoning and Abstract Reasoning Challenges
☆28Updated 3 months ago
oriyor / reasoning-on-cots
Implementation of the paper: "Answering Questions by Meta-Reasoning over Multiple Chains of Thought"
☆96Updated last year
archiki / ReCEval
Supporting code for ReCEval paper
☆29Updated 10 months ago