EleutherAI / concept-erasureLinks

Erasing concepts from neural representations with provable guarantees

☆239

Alternatives and similar repositories for concept-erasure

Users that are interested in concept-erasure are comparing it to the libraries listed below

Sorting:

EleutherAI / elk
Keeping language models honest by directly eliciting knowledge encoded in their activations.
☆214Updated this week
justinchiu / openlogprobs
Extract full next-token probabilities via language model APIs
☆248Updated last year
collin-burns / discovering_latent_knowledge
☆283Updated last year
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆231Updated 11 months ago
neelnanda-io / 1L-Sparse-Autoencoder
☆132Updated 2 years ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆130Updated 9 months ago
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆55Updated 7 months ago
nostalgebraist / transformer-utils
Utilities for the HuggingFace transformers library
☆72Updated 2 years ago
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆550Updated 4 months ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆302Updated 11 months ago
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆231Updated this week
google-deepmind / mishax
☆144Updated 3 months ago
anthropics / toy-models-of-superposition
Notebooks accompanying Anthropic's "Toy Models of Superposition" paper
☆130Updated 3 years ago
KihoPark / LLM_Categorical_Hierarchical_Representations
☆111Updated 9 months ago
mcleish7 / arithmetic
Code to reproduce "Transformers Can Do Arithmetic with the Right Embeddings", McLeish et al (NeurIPS 2024)
☆195Updated last year
r-three / git-theta
git extension for {collaborative, communal, continual} model development
☆216Updated last year
ArthurConmy / Automatic-Circuit-Discovery
☆258Updated last year
callummcdougall / sae_visualizer
☆29Updated last year
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆80Updated 4 months ago
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆74Updated 2 years ago
likenneth / othello_world
Emergent world representations: Exploring a sequence model trained on a synthetic task
☆191Updated 2 years ago
KoyenaPal / future-lens
Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
☆20Updated last month
annahdo / implementing_activation_steering
A collection of different ways to implement accessing and modifying internal model activations for LLMs
☆19Updated last year
jessicarumbelow / Backwards
☆85Updated last year
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
lukasberglund / reversal_curse
☆297Updated 2 years ago
redwoodresearch / Easy-Transformer
☆130Updated last year
ARBORproject / arborproject.github.io
☆83Updated 9 months ago
jonhue / activeft
PyTorch library for Active Fine-Tuning
☆95Updated 2 months ago
timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆117Updated 5 months ago