bilal-chughtai/rep-theory-mech-interp

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/bilal-chughtai/rep-theory-mech-interp)

bilal-chughtai / rep-theory-mech-interp

☆31

Alternatives and similar repositories for rep-theory-mech-interp

Users that are interested in rep-theory-mech-interp are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

hannamw / eap-ig-faithfulness
View on GitHub
Code for "Automatic Circuit Finding and Faithfulness"
☆19Jul 11, 2024Updated 2 years ago
Nix07 / belief_tracking
View on GitHub
This repository contains the code used for the experiments in the paper "Language Models use Lookbacks to Track Beliefs".
☆16Mar 14, 2026Updated 4 months ago
alan-cooney / transformer-lens-starter-template
View on GitHub
A quick way to get started with Transformer Lens
☆14Dec 13, 2023Updated 2 years ago
koayon / phil-interp-papers
View on GitHub
A curated reading list for researchers in the Philosophy of Interpretability
☆17Aug 17, 2025Updated 11 months ago
loftusa / owls
View on GitHub
Subliminal learning in LLMs: language models can transmit hidden preferences through seemingly unrelated training data.
☆25Nov 9, 2025Updated 8 months ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
anthropics / toy-models-of-superposition
View on GitHub
Notebooks accompanying Anthropic's "Toy Models of Superposition" paper
☆157Sep 14, 2022Updated 3 years ago
r-three / realistic_evaluation_of_model_merging_for_compositional_generalization
View on GitHub
☆13Feb 11, 2026Updated 5 months ago
vedantpalit / Towards-Vision-Language-Mechanistic-Interpretability
View on GitHub
This is the official repository for the "Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP" paper acce…
☆25Feb 16, 2026Updated 5 months ago
JasonGross / guarantees-based-mechanistic-interpretability
View on GitHub
☆18Jul 21, 2026Updated last week
safety-research / inoculation-prompting
View on GitHub
☆15Oct 13, 2025Updated 9 months ago
EleutherAI / deep-ignorance
View on GitHub
☆20Jan 7, 2026Updated 6 months ago
JonathanCrabbe / Label-Free-XAI
View on GitHub
This repository contains the implementation of Label-Free XAI, a new framework to adapt explanation methods to unsupervised models. For m…
☆25Sep 21, 2022Updated 3 years ago
ArthurConmy / Automatic-Circuit-Discovery
View on GitHub
☆293Oct 1, 2024Updated last year
TomFrederik / unseal
View on GitHub
Mechanistic Interpretability for Transformer Models
☆53Jun 1, 2022Updated 4 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
anthropics / DecompositionFaithfulnessPaper
View on GitHub
☆33Jul 17, 2023Updated 3 years ago
syvb / wikidata
View on GitHub
Rust library for working with data from Wikidata.
☆15Jul 10, 2025Updated last year
r-three / AttriBoT
View on GitHub
Code for AttriBoT from "AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution"
☆15Apr 21, 2025Updated last year
xuanlinli17 / autoregressive_inference
View on GitHub
Code for "Discovering Non-monotonic Autoregressive Orderings with Variational Inference" (paper and code updated from ICLR 2021)
☆12Mar 7, 2024Updated 2 years ago
LRudL / evalugator
View on GitHub
(Model-written) LLM evals library
☆19Dec 13, 2024Updated last year
UlisseMini / activation_additions_hf
View on GitHub
Reimplementation of https://github.com/montemac/algebraic_value_editing in pure PyTorch for efficiency on large models
☆11Jun 28, 2023Updated 3 years ago
dtch1997 / steering-bench
View on GitHub
Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"
☆22Dec 14, 2024Updated last year
g-luo / vlm_cross_modal_reps
View on GitHub
Official PyTorch Implementation for Vision-Language Models Create Cross-Modal Task Representations, ICML 2025
☆34May 1, 2025Updated last year
Silent-Zebra / twisted-smc-lm
View on GitHub
☆35Mar 27, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
nrimsky / LM-exp
View on GitHub
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆105Sep 21, 2023Updated 2 years ago
naganandy / G-MPNN-R
View on GitHub
☆14Jan 10, 2021Updated 5 years ago
redwoodresearch / rust_circuit_public
View on GitHub
☆67Feb 16, 2023Updated 3 years ago
apartresearch / mechanisticinterpretability
View on GitHub
A repository for awesome resources in mechanistic interpretability
☆16Jan 18, 2023Updated 3 years ago
LRudL / sad
View on GitHub
Situational Awareness Dataset
☆52Dec 14, 2024Updated last year
annahdo / implementing_activation_steering
View on GitHub
A collection of different ways to implement accessing and modifying internal model activations for LLMs
☆24Oct 18, 2024Updated last year
ethz-spylab / unlearning-vs-safety
View on GitHub
☆27Oct 6, 2024Updated last year
princeton-nlp / Edge-Pruning
View on GitHub
[NeurIPS 2024 Spotlight] Code and data for the paper "Finding Transformer Circuits with Edge Pruning".
☆70Aug 15, 2025Updated 11 months ago
socketteer / worldspider
View on GitHub
gpt completions in vscode
☆35Mar 24, 2023Updated 3 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
wesg52 / sparse-probing-paper
View on GitHub
Sparse probing paper full code.
☆68Dec 17, 2023Updated 2 years ago
UFO-101 / auto-circuit
View on GitHub
A library for efficient patching and automatic circuit discovery.
☆99Dec 31, 2025Updated 6 months ago
dbigham / ARC
View on GitHub
Abstraction and Reasoning Corpus
☆15Nov 22, 2022Updated 3 years ago
timaeus-research / devinterp
View on GitHub
Tools for studying developmental interpretability in neural networks.
☆146Apr 23, 2026Updated 3 months ago
real-absolute-AI / Unnatural_Language
View on GitHub
The official repository of 'Unnatural Language Are Not Bugs but Features for LLMs'
☆24May 20, 2025Updated last year
Trustworthy-ML-Lab / Linear-Explanations
View on GitHub
[ICML 24] A novel automated neuron explanation framework that can accurately describe poly-semantic concepts in deep neural networks
☆14May 2, 2025Updated last year
aflah02 / TokenSmith
View on GitHub
A comprehensive toolkit for streamlining data editing, search, and inspection for large-scale language model training and interpretabilit…
☆21Oct 30, 2025Updated 8 months ago