mitvis / saliency-cards

Saliency Cards are transparency documentation for saliency methods. Learn about new saliency methods or document your own!

☆16

Alternatives and similar repositories for saliency-cards:

Users that are interested in saliency-cards are comparing it to the libraries listed below

mlepori1 / NeuroSurgeon
NeuroSurgeon is a package that enables researchers to uncover and manipulate subnetworks within models in Huggingface Transformers
☆41Updated 2 months ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆95Updated 2 months ago
neelnanda-io / 1L-Sparse-Autoencoder
☆121Updated last year
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆62Updated 2 months ago
wesg52 / sparse-probing-paper
Sparse probing paper full code.
☆55Updated last year
EleutherAI / concept-erasure
Erasing concepts from neural representations with provable guarantees
☆227Updated 2 months ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆91Updated last year
mishajw / repeng
Experiments with representation engineering
☆11Updated last year
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆70Updated last year
evandez / REMEDI
Inspecting and Editing Knowledge Representations in Language Models
☆115Updated last year
thestephencasper / everything-you-need
we got you bro
☆35Updated 8 months ago
ArthurConmy / Automatic-Circuit-Discovery
☆218Updated 6 months ago
TomFrederik / unseal
Mechanistic Interpretability for Transformer Models
☆50Updated 2 years ago
ApolloResearch / apd
Attribution-based Parameter Decomposition
☆17Updated this week
EleutherAI / elk
Keeping language models honest by directly eliciting knowledge encoded in their activations.
☆199Updated last week
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆48Updated 5 months ago
wesg52 / universal-neurons
Universal Neurons in GPT2 Language Models
☆27Updated 10 months ago
adamkarvonen / SAEBench
☆85Updated last week
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆241Updated 4 months ago
milesaturpin / cot-unfaithfulness
☆41Updated last year
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆68Updated 10 months ago
microsoft / mechanistic-error-probe
A mechanistic approach for understanding and detecting factual errors of large language models.
☆43Updated 9 months ago
timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆88Updated 2 months ago
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆169Updated this week
PAIR-code / interpretability
PAIR.withgoogle.com and friend's work on interpretability methods
☆180Updated last week
KihoPark / LLM_Categorical_Hierarchical_Representations
☆90Updated 2 months ago
montemac / activation_additions
Algebraic value editing in pretrained language models
☆63Updated last year
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆74Updated last year
jbloomAus / SAEDashboard
☆36Updated last month
explanare / ravel
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆43Updated 6 months ago