multimodal-interpretability / FINDLinks
Official implementation of FIND (NeurIPS '23) Function Interpretation Benchmark and Automated Interpretability Agents
โ49Updated 8 months ago
Alternatives and similar repositories for FIND
Users that are interested in FIND are comparing it to the libraries listed below
Sorting:
- Advantage Leftover Lunch Reinforcement Learning (A-LoL RL): Improving Language Models with Advantage-based Offline Policy Gradientsโ26Updated 8 months ago
- ๐ป Code and benchmark for our EMNLP 2023 paper - "FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions"โ55Updated last year
- Sparse and discrete interpretability tool for neural networksโ63Updated last year
- โ97Updated 11 months ago
- Code for reproducing our paper "Not All Language Model Features Are Linear"โ75Updated 6 months ago
- โ84Updated 10 months ago
- โ23Updated 4 months ago
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amountโฆโ54Updated last year
- Self-Supervised Alignment with Mutual Informationโ19Updated last year
- Online Adaptation of Language Models with a Memory of Amortized Contexts (NeurIPS 2024)โ63Updated 10 months ago
- Q-Probe: A Lightweight Approach to Reward Maximization for Language Modelsโ41Updated 11 months ago
- Official implementation of MAIA, A Multimodal Automated Interpretability Agentโ81Updated 3 months ago
- โ93Updated 11 months ago
- Reinforcement Learning via Regressing Relative Rewardsโ33Updated 5 months ago
- A mechanistic approach for understanding and detecting factual errors of large language models.โ46Updated 11 months ago
- This is code for most of the experiments in the paper Understanding the Effects of RLHF on LLM Generalisation and Diversityโ43Updated last year
- [๐๐๐๐๐ ๐ ๐ข๐ง๐๐ข๐ง๐ ๐ฌ ๐๐๐๐ & ๐๐๐ ๐๐๐๐ ๐๐๐๐๐ ๐๐ซ๐๐ฅ] ๐๐ฏ๐ฉ๐ข๐ฏ๐ค๐ช๐ฏ๐จ ๐๐ข๐ต๐ฉ๐ฆ๐ฎ๐ข๐ต๐ช๐ค๐ข๐ญ ๐๐ฆ๐ข๐ด๐ฐ๐ฏ๐ช๐ฏโฆโ51Updated last year
- PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)โ35Updated 7 months ago
- โ21Updated 8 months ago
- Repository for NPHardEval, a quantified-dynamic benchmark of LLMsโ54Updated last year
- Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"โ70Updated 11 months ago
- โ35Updated 2 years ago
- โ26Updated 2 years ago
- โ32Updated 4 months ago
- Repository for the code of the "PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided Decoding" paper, NAACL'22โ66Updated 2 years ago
- Language models scale reliably with over-training and on downstream tasksโ97Updated last year
- โ54Updated last year
- โ79Updated 9 months ago
- Code for LaMPP: Language Models as Probabilistic Priors for Perception and Actionโ37Updated 2 years ago
- RL algorithm: Advantage induced policy alignmentโ65Updated last year