multimodal-interpretability / FIND
Official implementation of FIND (NeurIPS '23) Function Interpretation Benchmark and Automated Interpretability Agents
☆48Updated 3 months ago
Alternatives and similar repositories for FIND:
Users that are interested in FIND are comparing it to the libraries listed below
- Advantage Leftover Lunch Reinforcement Learning (A-LoL RL): Improving Language Models with Advantage-based Offline Policy Gradients☆26Updated 4 months ago
- Official implementation of MAIA, A Multimodal Automated Interpretability Agent☆70Updated 5 months ago
- Code for reproducing our paper "Not All Language Model Features Are Linear"☆66Updated last month
- 👻 Code and benchmark for our EMNLP 2023 paper - "FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions"☆52Updated 7 months ago
- ☆76Updated 6 months ago
- Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"☆64Updated 6 months ago
- ☆93Updated 6 months ago
- This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…☆21Updated 9 months ago
- ☆82Updated 11 months ago
- Official implementation of the transformer (TF) architecture suggested in a paper entitled "Looped Transformers as Programmable Computers…☆24Updated last year
- Implementation of PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆31Updated 2 months ago
- ☆43Updated 5 months ago
- Official PyTorch implementation of "Neural Relation Graph: A Unified Framework for Identifying Label Noise and Outlier Data" (NeurIPS'23)☆15Updated last year
- ☆26Updated 6 months ago
- Repo for: When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment☆38Updated last year
- Open source replication of Anthropic's Crosscoders for Model Diffing☆28Updated 2 months ago
- Implementation of Bitune: Bidirectional Instruction-Tuning☆16Updated 7 months ago
- ☆52Updated last year
- Official Code Repository for EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents (COLM 2024)☆27Updated 6 months ago
- ☆69Updated 5 months ago
- ☆12Updated 10 months ago
- ☆44Updated last year
- Q-Probe: A Lightweight Approach to Reward Maximization for Language Models☆40Updated 7 months ago
- ☆50Updated 2 months ago
- Repository for the code of the "PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided Decoding" paper, NAACL'22☆65Updated 2 years ago
- A mechanistic approach for understanding and detecting factual errors of large language models.☆39Updated 6 months ago
- This is code for most of the experiments in the paper Understanding the Effects of RLHF on LLM Generalisation and Diversity☆39Updated 11 months ago
- Simple and scalable tools for data-driven pretraining data selection.☆14Updated this week
- Official implementation of Bootstrapping Language Models via DPO Implicit Rewards☆41Updated 5 months ago