multimodal-interpretability / FIND
Official implementation of FIND (NeurIPS '23) Function Interpretation Benchmark and Automated Interpretability Agents
☆45Updated last month
Related projects ⓘ
Alternatives and complementary repositories for FIND
- Official implementation of MAIA, A Multimodal Automated Interpretability Agent☆62Updated 2 months ago
- ☆75Updated 9 months ago
- Advantage Leftover Lunch Reinforcement Learning (A-LoL RL): Improving Language Models with Advantage-based Offline Policy Gradients☆24Updated last month
- Code for reproducing our paper "Not All Language Model Features Are Linear"☆60Updated last month
- ☆89Updated 4 months ago
- ☆73Updated 4 months ago
- Language models scale reliably with over-training and on downstream tasks☆94Updated 7 months ago
- ☆44Updated last year
- Directional Preference Alignment☆49Updated last month
- ☆99Updated this week
- Online Adaptation of Language Models with a Memory of Amortized Contexts (NeurIPS 2024)☆53Updated 3 months ago
- Sparse and discrete interpretability tool for neural networks☆53Updated 8 months ago
- A mechanistic approach for understanding and detecting factual errors of large language models.☆39Updated 4 months ago
- ☆50Updated last year
- Function Vectors in Large Language Models (ICLR 2024)☆116Updated 3 weeks ago
- Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"☆61Updated 4 months ago
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision☆95Updated 2 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆83Updated 7 months ago
- ☆65Updated 7 months ago
- This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…☆17Updated 7 months ago
- Repository for the code of the "PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided Decoding" paper, NAACL'22☆64Updated 2 years ago
- ☆50Updated last month
- ☆102Updated last month
- The official implementation of Self-Exploring Language Models (SELM)☆56Updated 5 months ago
- Repository for the paper Stream of Search: Learning to Search in Language☆84Updated 2 months ago
- This repository includes code for the paper "Does Localization Inform Editing? Surprising Differences in Where Knowledge Is Stored vs. Ca…☆55Updated last year
- Official implementation of Bootstrapping Language Models via DPO Implicit Rewards☆39Updated 3 months ago
- ☆23Updated 3 months ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆25Updated 5 months ago