PAIR-code/interpretability

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/PAIR-code/interpretability)

PAIR-code / interpretability

PAIR.withgoogle.com and friend's work on interpretability methods

☆234

Alternatives and similar repositories for interpretability

Users that are interested in interpretability are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

saprmarks / feature-circuits
View on GitHub
☆223Oct 14, 2025Updated 9 months ago
ckkissane / crosscoder-model-diff-replication
View on GitHub
Open source replication of Anthropic's Crosscoders for Model Diffing
☆68Oct 27, 2024Updated last year
google / belief-localization
View on GitHub
This repository includes code for the paper "Does Localization Inform Editing? Surprising Differences in Where Knowledge Is Stored vs. Ca…
☆62May 9, 2023Updated 3 years ago
callummcdougall / sae_vis
View on GitHub
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆267Feb 27, 2026Updated 4 months ago
EleutherAI / sparsify
View on GitHub
Sparsify transformers with SAEs and transcoders
☆734Updated this week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
View on GitHub
This repository collects all relevant resources about interpretability in LLMs
☆402Nov 1, 2024Updated last year
aviclu / ffn-values
View on GitHub
☆67May 18, 2023Updated 3 years ago
Alrope123 / prompt-waywardness
View on GitHub
☆14Apr 27, 2022Updated 4 years ago
jiahai-feng / binding-iclr
View on GitHub
☆19Mar 5, 2024Updated 2 years ago
AlignmentResearch / tuned-lens
View on GitHub
Tools for understanding how transformer predictions are built layer-by-layer
☆605Aug 7, 2025Updated 11 months ago
koayon / atp_star
View on GitHub
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
☆20Jan 19, 2025Updated last year
TransformerLensOrg / TransformerLens
View on GitHub
A library for mechanistic interpretability of GPT-style language models
☆3,712Updated this week
decoderesearch / SAELens
View on GitHub
Training Sparse Autoencoders on Language Models
☆1,484Updated this week
TransluceAI / circuits
View on GitHub
ADAG: Transluce's MLP neuron-level circuit tracing library
☆34Apr 10, 2026Updated 3 months ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
saprmarks / dictionary_learning
View on GitHub
☆428Aug 21, 2025Updated 11 months ago
andyrdt / refusal_direction
View on GitHub
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆424Jun 13, 2025Updated last year
mega002 / ff-layers
View on GitHub
The accompanying code for "Transformer Feed-Forward Layers Are Key-Value Memories". Mor Geva, Roei Schuster, Jonathan Berant, and Omer Le…
☆103Sep 5, 2021Updated 4 years ago
ndif-team / nnsight
View on GitHub
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆998Updated this week
ArthurConmy / Automatic-Circuit-Discovery
View on GitHub
☆293Oct 1, 2024Updated last year
yihuaihong / ConceptVectors
View on GitHub
[EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆40Aug 20, 2025Updated 11 months ago
edenbiran / HoppingTooLate
View on GitHub
Exploring the Limitations of Large Language Models on Multi-Hop Queries
☆33Mar 2, 2025Updated last year
collin-burns / discovering_latent_knowledge
View on GitHub
☆287Mar 2, 2024Updated 2 years ago
KoyenaPal / future-lens
View on GitHub
Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
☆21Oct 24, 2025Updated 9 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
MaheepChaudhary / SAE-Ravel
View on GitHub
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆13Jan 26, 2025Updated last year
technion-cs-nlp / vlm-circuits-analysis
View on GitHub
Code for the experiments and websites of the paper "Same Task, Different Circuits"
☆36Updated this week
Betswish / MIRAGE
View on GitHub
Easy-to-use MIRAGE code for faithful answer attribution in RAG applications. Paper: https://aclanthology.org/2024.emnlp-main.347/
☆25Mar 10, 2025Updated last year
jacobdunefsky / transcoder_circuits
View on GitHub
☆211Nov 17, 2024Updated last year
davidbau / baukit
View on GitHub
☆257Feb 22, 2024Updated 2 years ago
GChrysostomou / ood_faith
View on GitHub
☆13Jul 26, 2023Updated 3 years ago
KihoPark / LLM_Categorical_Hierarchical_Representations
View on GitHub
Code for 'The Geometry of Categorical and Hierarchical Concepts in Large Language Models' (ICLR 2025, Oral)
☆115Feb 11, 2025Updated last year
langtech-bsc / mt-evaluation
View on GitHub
A framework for evaluating Machine Translation models.
☆13Apr 21, 2026Updated 3 months ago
ApolloResearch / deception-detection
View on GitHub
☆44Feb 11, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
adamkarvonen / SAEBench
View on GitHub
☆178May 1, 2026Updated 2 months ago
tilde-research / activault
View on GitHub
Engine for collecting, uploading, and downloading model activations
☆30Apr 2, 2025Updated last year
KihoPark / linear_rep_geometry
View on GitHub
Code for 'The Linear Representation Hypothesis and the Geometry of Large Language Models' (ICML 2024)
☆125Feb 11, 2025Updated last year
IKMLab / arct2
View on GitHub
Code for reproducing experiments in our ACL 2019 paper "Probing Neural Network Comprehension of Natural Language Arguments"
☆54Jun 21, 2022Updated 4 years ago
ai-safety-foundation / sparse_autoencoder
View on GitHub
Sparse Autoencoder for Mechanistic Interpretability
☆303Jul 20, 2024Updated 2 years ago
GraySwanAI / circuit-breakers
View on GitHub
Improving Alignment and Robustness with Circuit Breakers
☆266Sep 24, 2024Updated last year
koayon / phil-interp-papers
View on GitHub
A curated reading list for researchers in the Philosophy of Interpretability
☆17Aug 17, 2025Updated 11 months ago