multimodal-interpretability / FINDLinks

Official implementation of FIND (NeurIPS '23) Function Interpretation Benchmark and Automated Interpretability Agents

☆49

Alternatives and similar repositories for FIND

Users that are interested in FIND are comparing it to the libraries listed below

Sorting:

taufeeque9 / codebook-features
Sparse and discrete interpretability tool for neural networks
☆63Updated last year
architsharma97 / dpo-rlaif
☆99Updated last year
KihoPark / LLM_Categorical_Hierarchical_Representations
☆104Updated 5 months ago
skywalker023 / fantom
👻 Code and benchmark for our EMNLP 2023 paper - "FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions"
☆55Updated last year
KihoPark / linear_rep_geometry
☆103Updated 5 months ago
abhishekpanigrahi1996 / transformer_in_transformer
☆45Updated last year
feradauto / MoralCoT
Repo for: When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment
☆38Updated 2 years ago
guy-dar / embedding-space
☆54Updated 2 years ago
jihoontack / MAC
Online Adaptation of Language Models with a Memory of Amortized Contexts (NeurIPS 2024)
☆65Updated last year
princeton-nlp / TransformerPrograms
[NeurIPS 2023] Learning Transformer Programs
☆162Updated last year
justinlovelace / Diffusion-Guided-LM
☆27Updated 11 months ago
adamkarvonen / SAE_BoardGameEval
☆23Updated 6 months ago
microsoft / RLHF-APA
RL algorithm: Advantage induced policy alignment
☆65Updated last year
JoshEngels / MultiDimensionalFeatures
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆77Updated 8 months ago
gregorbachmann / Next-Token-Failures
☆89Updated last year
abdulhaim / LMRL-Gym
☆99Updated last year
google / belief-localization
This repository includes code for the paper "Does Localization Inform Editing? Surprising Differences in Where Knowledge Is Stored vs. Ca…
☆61Updated 2 years ago
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆72Updated last year
microsoft / mechanistic-error-probe
A mechanistic approach for understanding and detecting factual errors of large language models.
☆47Updated last year
facebookresearch / rlfh-gen-div
This is code for most of the experiments in the paper Understanding the Effects of RLHF on LLM Generalisation and Diversity
☆44Updated last year
vedantpalit / Towards-Vision-Language-Mechanistic-Interpretability
This is the official repository for the "Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP" paper acce…
☆22Updated last year
mlfoundations / scaling
Language models scale reliably with over-training and on downstream tasks
☆97Updated last year
jonhue / activeft
PyTorch library for Active Fine-Tuning
☆87Updated 5 months ago
likenneth / othello_world
Emergent world representations: Exploring a sequence model trained on a synthetic task
☆186Updated 2 years ago
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆113Updated last year
ericwtodd / function_vectors
Function Vectors in Large Language Models (ICLR 2024)
☆175Updated 3 months ago
causalNLP / cladder
We develop benchmarks and analysis tools to evaluate the causal reasoning abilities of LLMs.
☆119Updated last year
NohTow / PPL-MCTS
Repository for the code of the "PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided Decoding" paper, NAACL'22
☆66Updated 2 years ago
DeqingFu / transformers-icl-second-order
Official repository for our paper, Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Mode…
☆17Updated 8 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆112Updated last month