PAIR-code / interpretability
PAIR.withgoogle.com and friend's work on interpretability methods
☆166Updated this week
Alternatives and similar repositories for interpretability:
Users that are interested in interpretability are comparing it to the libraries listed below
- ☆116Updated last year
- Utilities for the HuggingFace transformers library☆64Updated 2 years ago
- This repository accompanies our paper “Do Prompt-Based Models Really Understand the Meaning of Their Prompts?”☆85Updated 2 years ago
- Code to reproduce data for Bias in Bios☆43Updated last year
- The accompanying code for "Transformer Feed-Forward Layers Are Key-Value Memories". Mor Geva, Roei Schuster, Jonathan Berant, and Omer Le…☆89Updated 3 years ago
- A library for efficient patching and automatic circuit discovery.☆53Updated 2 months ago
- A library for finding knowledge neurons in pretrained transformer models.☆153Updated 3 years ago
- A Python library that encapsulates various methods for neuron interpretation and analysis in Deep NLP models.☆100Updated last year
- Mechanistic Interpretability Visualizations using React☆227Updated last month
- The official code of LM-Debugger, an interactive tool for inspection and intervention in transformer-based language models.☆178Updated 2 years ago
- Inspecting and Editing Knowledge Representations in Language Models☆112Updated last year
- Fairness toolkit for pytorch, scikit learn and autogluon☆31Updated 2 months ago
- Materials for EACL2024 tutorial: Transformer-specific Interpretability☆44Updated 10 months ago
- ☆81Updated this week
- ☆96Updated 2 years ago
- ☆203Updated 4 months ago
- A fast, effective data attribution method for neural networks in PyTorch☆192Updated 2 months ago
- Erasing concepts from neural representations with provable guarantees☆222Updated 2 weeks ago
- ☆189Updated 11 months ago
- This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".☆87Updated 3 years ago
- Steering Llama 2 with Contrastive Activation Addition☆122Updated 8 months ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆182Updated 2 months ago
- ☆149Updated this week
- Mechanistic Interpretability for Transformer Models☆49Updated 2 years ago
- ☆76Updated 6 months ago
- ☆53Updated last year
- ☆87Updated 2 years ago
- ☆61Updated last year
- ☆109Updated 6 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆87Updated 2 months ago