safety-research / open-source-alignment-fakingLinks

Open Source Replication of Anthropic's Alignment Faking Paper

☆13

Alternatives and similar repositories for open-source-alignment-faking

Users that are interested in open-source-alignment-faking are comparing it to the libraries listed below

Sorting:

EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆27Updated last year
milesaturpin / cot-unfaithfulness
☆44Updated last year
jiahai-feng / binding-iclr
☆14Updated last year
tml-epfl / icl-alignment
Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]
☆30Updated 5 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆95Updated 3 weeks ago
MaheepChaudhary / SAE-Ravel
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆11Updated 5 months ago
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆70Updated last year
ejones313 / auditing-llms
☆54Updated 2 years ago
hijohnnylin / automated-interpretability
☆10Updated last month
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆67Updated 2 months ago
KoyenaPal / future-lens
Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
☆18Updated last year
koayon / atp_star
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
☆18Updated 5 months ago
formll / resolving-scaling-law-discrepancies
☆20Updated 11 months ago
Jiaxin-Wen / MisleadLM
Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""
☆14Updated 8 months ago
saprmarks / geometry-of-truth
☆85Updated 10 months ago
mcleish7 / gemstone-scaling-laws
☆26Updated 4 months ago
JoshEngels / MultiDimensionalFeatures
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆75Updated 7 months ago
IBM / sae-steering
Code to enable layer-level steering in LLMs using sparse auto encoders
☆19Updated last month
montemac / activation_additions
Algebraic value editing in pretrained language models
☆65Updated last year
scalable-model-editing / unified-model-editing
We introduce EMMET and unify model editing with popular algorithms ROME and MEMIT.
☆22Updated 6 months ago
slavachalnev / SAE-TS
Improving Steering Vectors by Targeting Sparse Autoencoder Features
☆21Updated 7 months ago
Nix07 / finetuning
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…
☆27Updated last year
MadryLab / DsDm
☆48Updated last year
felixbinder / introspection_self_prediction
Code for experiments on self-prediction as a way to measure introspection in LLMs
☆14Updated 6 months ago
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆52Updated last month
yihuaihong / ConceptVectors
ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆35Updated 4 months ago
nrimsky / InfluenceFunctions
Implementation of Influence Function approximations for differently sized ML models, using PyTorch
☆15Updated last year
r-three / RAD
Reference implementation for Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
☆43Updated last year
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆55Updated 8 months ago
lingo-mit / lm-truthfulness
☆17Updated last year