redwoodresearch / alignment_faking_publicLinks

☆81

Alternatives and similar repositories for alignment_faking_public

Users that are interested in alignment_faking_public are comparing it to the libraries listed below

Sorting:

ArthurConmy / Automatic-Circuit-Discovery
☆253Updated last year
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆193Updated last year
jacobdunefsky / transcoder_circuits
☆188Updated last year
saprmarks / feature-circuits
☆194Updated last month
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆225Updated last week
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆283Updated 2 years ago
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆284Updated last year
redwoodresearch / Easy-Transformer
☆129Updated last year
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆119Updated 2 months ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆99Updated 2 years ago
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆227Updated 11 months ago
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆299Updated 5 months ago
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆60Updated last year
neelnanda-io / Crosscoders
☆56Updated last year
UKGovernmentBEIS / control-arena
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆128Updated this week
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆129Updated 8 months ago
neelnanda-io / 1L-Sparse-Autoencoder
☆132Updated 2 years ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆301Updated 11 months ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆80Updated 4 months ago
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆55Updated 6 months ago
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆121Updated last year
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆242Updated last year
montemac / activation_additions
Algebraic value editing in pretrained language models
☆66Updated 2 years ago
adamkarvonen / SAEBench
☆136Updated this week
saprmarks / geometry-of-truth
☆94Updated last year
safety-research / safety-tooling
Inference API for many LLMs and other useful tools for empirical research
☆80Updated this week
aypan17 / machiavelli
☆141Updated 3 months ago
METR / task-standard
METR Task Standard
☆167Updated 9 months ago
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆96Updated 2 years ago
science-of-finetuning / crosscoder_learning
Modified to support crosscoder training.
☆24Updated last month