ckkissane / sae-transferLinks

Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"

☆12

Alternatives and similar repositories for sae-transfer

Users that are interested in sae-transfer are comparing it to the libraries listed below

Sorting:

UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆78Updated 3 months ago
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆55Updated 6 months ago
callummcdougall / sae-exercises-mats
☆23Updated last year
KoyenaPal / future-lens
Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
☆20Updated last week
montemac / activation_additions
Algebraic value editing in pretrained language models
☆66Updated 2 years ago
adamkarvonen / SAE_BoardGameEval
☆23Updated 9 months ago
explanare / ravel
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆56Updated last year
MaheepChaudhary / SAE-Ravel
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆12Updated 9 months ago
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆59Updated last year
hijohnnylin / automated-interpretability
☆15Updated 3 weeks ago
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Updated last year
koayon / atp_star
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
☆20Updated 9 months ago
saprmarks / geometry-of-truth
☆92Updated last year
Nix07 / finetuning
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…
☆28Updated this week
GSYfate / knnlm-limits
Official code repo for paper "Great Memory, Shallow Reasoning: Limits of kNN-LMs"
☆24Updated 6 months ago
jiahai-feng / binding-iclr
☆16Updated last year
Aaquib111 / edge-attribution-patching
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"
☆42Updated last year
tilde-research / activault
Engine for collecting, uploading, and downloading model activations
☆24Updated 7 months ago
tml-epfl / icl-alignment
Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]
☆31Updated 9 months ago
mcleish7 / gemstone-scaling-laws
Gemstones: A Model Suite for Multi-Faceted Scaling Laws (NeurIPS 2025)
☆29Updated last month
angie-chen55 / pref-learning-ranking-acc
☆13Updated last year
msakarvadia / AttentionLens
Interpretating the latent space representations of attention head outputs for LLMs
☆34Updated last year
Butanium / tiny-activation-dashboard
A tiny easily hackable implementation of a feature dashboard.
☆15Updated last week
JoshEngels / MultiDimensionalFeatures
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆83Updated 11 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆136Updated 4 months ago
google / belief-localization
This repository includes code for the paper "Does Localization Inform Editing? Surprising Differences in Where Knowledge Is Stored vs. Ca…
☆61Updated 2 years ago
formll / resolving-scaling-law-discrepancies
☆20Updated last year
wesg52 / universal-neurons
Universal Neurons in GPT2 Language Models
☆30Updated last year
amack315 / unsupervised-steering-vectors
☆36Updated last year