frankaging / Interchange-Intervention-TrainingLinks
The codebase for Inducing Causal Structure for Interpretable Neural Networks
☆10Updated 3 years ago
Alternatives and similar repositories for Interchange-Intervention-Training
Users that are interested in Interchange-Intervention-Training are comparing it to the libraries listed below
Sorting:
- Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…☆11Updated 5 months ago
- ☆36Updated 2 years ago
- CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior☆12Updated 2 years ago
- Evaluate interpretability methods on localizing and disentangling concepts in LLMs.☆47Updated 8 months ago
- Code for preprint: Summarizing Differences between Text Distributions with Natural Language☆42Updated 2 years ago
- ☆44Updated last year
- ☆9Updated last year
- A Kernel-Based View of Language Model Fine-Tuning https://arxiv.org/abs/2210.05643☆75Updated last year
- ☆11Updated 3 years ago
- ☆9Updated last year
- Model zoo for different kinds of uncertainty quantification methods used in Natural Language Processing, implemented in PyTorch.☆53Updated 2 years ago
- Code for the paper "Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias"☆78Updated 3 years ago
- Code for "Tracing Knowledge in Language Models Back to the Training Data"☆38Updated 2 years ago
- ☆44Updated last year
- ☆25Updated 7 months ago
- Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments (Zhou et al., EMNLP 2024)☆13Updated 8 months ago
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆35Updated last year
- [NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors☆77Updated 6 months ago
- ☆30Updated 11 months ago
- Restore safety in fine-tuned language models through task arithmetic☆28Updated last year
- ☆29Updated last year
- A framework to train language models to learn invariant representations.☆14Updated 3 years ago
- Align your LM to express calibrated verbal statements of confidence in its long-form generations.☆26Updated last year
- ☆18Updated last year
- In-context Example Selection with Influences☆15Updated 2 years ago
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆60Updated 7 months ago
- Landing page for MIB: A Mechanistic Interpretability Benchmark☆12Updated last week
- EMNLP 2022: "MABEL: Attenuating Gender Bias using Textual Entailment Data" https://arxiv.org/abs/2210.14975☆38Updated last year
- Explaining neural decisions contrastively to alternative decisions.☆25Updated 4 years ago
- Discretized Integrated Gradients for Explaining Language Models (EMNLP 2021)☆27Updated 3 years ago