milesaturpin/cot-unfaithfulness

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/milesaturpin/cot-unfaithfulness)

milesaturpin / cot-unfaithfulness

☆57

Alternatives and similar repositories for cot-unfaithfulness

Users that are interested in cot-unfaithfulness are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

raybears / cot-transparency
View on GitHub
Improving transparency of large language models' reasoning
☆15Nov 25, 2025Updated 8 months ago
aengusl / latent-adversarial-training
View on GitHub
☆48Sep 29, 2024Updated last year
jettjaniak / chainscope
View on GitHub
Repository for the "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful" paper
☆35Mar 31, 2026Updated 3 months ago
aypan17 / reward-misspecification
View on GitHub
☆10Mar 13, 2023Updated 3 years ago
ethz-spylab / superhuman-ai-consistency
View on GitHub
☆30Jun 19, 2023Updated 3 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
redwoodresearch / alignment_faking_public
View on GitHub
☆95Oct 8, 2025Updated 9 months ago
montemac / activation_additions
View on GitHub
Algebraic value editing in pretrained language models
☆71Nov 1, 2023Updated 2 years ago
LoryPack / LLM-LieDetector
View on GitHub
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆74Jun 19, 2024Updated 2 years ago
TransluceAI / introspective-interp
View on GitHub
Repository for "Training Language Models To Explain Their Own Computations"
☆23Jul 7, 2026Updated 3 weeks ago
ronentk / dbca-splitter
View on GitHub
Independent implementation of DBCA method from http://arxiv.org/abs/1912.09713
☆11Nov 25, 2020Updated 5 years ago
thestephencasper / benchmarking_interpretability
View on GitHub
☆35Sep 13, 2023Updated 2 years ago
apartresearch / DarkBench
View on GitHub
Benchmarking Dark Patterns in LLMs (ICLR 2025)
☆18Mar 29, 2025Updated last year
thejaminator / latteries
View on GitHub
James' cookbook of evaluations and finetuning experiments
☆32Feb 19, 2026Updated 5 months ago
keven980716 / weak-to-strong-deception
View on GitHub
[ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"
☆15Jun 21, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
ucl-dark / llm_debate
View on GitHub
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆131Mar 22, 2024Updated 2 years ago
daeveraert / gradient-information-optimization
View on GitHub
Implementation of Gradient Information Optimization (GIO) for effective and scalable training data selection
☆14Jun 22, 2023Updated 3 years ago
rmin2000 / adv_tracing
View on GitHub
Identification of the Adversary from a Single Adversarial Example (ICML 2023)
☆10Jul 15, 2024Updated 2 years ago
ekinakyurek / influence
View on GitHub
Code for "Tracing Knowledge in Language Models Back to the Training Data"
☆40Dec 27, 2022Updated 3 years ago
peterbhase / LAS-NL-Explanations
View on GitHub
Code for paper "Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?"
☆21Oct 13, 2020Updated 5 years ago
F2-Song / ICDPO
View on GitHub
The official implementation of "ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization…
☆16Feb 15, 2024Updated 2 years ago
yoavgur / PISCES
View on GitHub
🪝PISCES - Precise In-Parameter Suppression for Concept EraSure in Large Language Models
☆13Jun 28, 2026Updated last month
shadowkiller33 / Contrast-Instruction
View on GitHub
☆19Oct 2, 2023Updated 2 years ago
MaheepChaudhary / SAE-Ravel
View on GitHub
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆13Jan 26, 2025Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
Trustworthy-ML-Lab / Describe-and-Dissect
View on GitHub
[TMLR 25] An automated method for explaining complex neuron behaviors in deep vision models using large language models
☆11Feb 20, 2025Updated last year
mlepori1 / NeuroSurgeon
View on GitHub
NeuroSurgeon is a package that enables researchers to uncover and manipulate subnetworks within models in Huggingface Transformers
☆43Feb 12, 2025Updated last year
GraySwanAI / circuit-breakers
View on GitHub
Improving Alignment and Robustness with Circuit Breakers
☆266Sep 24, 2024Updated last year
Heidelberg-NLP / CC-SHAP-VLM
View on GitHub
Official code implementation for the paper "Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Expl…
☆12Jul 14, 2026Updated 2 weeks ago
jinlanfu / Polyglot_Prompt
View on GitHub
Code and dataset for Polyglot Prompting: Multilingual Multitask Prompt Training.
☆18Dec 7, 2022Updated 3 years ago
avalonstrel / Mitigating-the-Alignment-Tax-of-RLHF
View on GitHub
☆16Feb 8, 2024Updated 2 years ago
pratyushmaini / ssft
View on GitHub
[NeurIPS'22] Official Repository for Characterizing Datapoints via Second-Split Forgetting
☆16Aug 11, 2023Updated 2 years ago
xiye17 / EvalQAExpl
View on GitHub
Code for Evaluating Explanations for Reading Comprehension with Realistic Counterfactuals.
☆17Apr 25, 2021Updated 5 years ago
Egg-Hu / Awesome-Synthetic-Data-Generation
View on GitHub
☆20Jan 7, 2026Updated 6 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
HanjieChen / REV
View on GitHub
Code for the paper "REV: Information-Theoretic Evaluation of Free-Text Rationales"
☆16Aug 11, 2023Updated 2 years ago
vedantpalit / Towards-Vision-Language-Mechanistic-Interpretability
View on GitHub
This is the official repository for the "Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP" paper acce…
☆25Feb 16, 2026Updated 5 months ago
MJ-Jang / BECEL
View on GitHub
☆10Jan 28, 2024Updated 2 years ago
visinf / fast-axiomatic-attribution
View on GitHub
Fast Axiomatic Attribution for Neural Networks (NeurIPS*2021)
☆15Feb 24, 2026Updated 5 months ago
InvokerStark / OverKill
View on GitHub
☆16Jun 13, 2024Updated 2 years ago
SophieZheng998 / RSafe
View on GitHub
Official implementation for "RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards"
☆17Jan 31, 2026Updated 5 months ago
EleutherAI / concept-erasure
View on GitHub
Erasing concepts from neural representations with provable guarantees
☆258Jan 27, 2025Updated last year