adamkarvonen / SAE_BoardGameEvalLinks

☆23

Alternatives and similar repositories for SAE_BoardGameEval

Users that are interested in SAE_BoardGameEval are comparing it to the libraries listed below

Sorting:

wesg52 / universal-neurons
Universal Neurons in GPT2 Language Models
☆31Updated last year
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆55Updated 7 months ago
jiahai-feng / binding-iclr
☆16Updated last year
JoshEngels / MultiDimensionalFeatures
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆84Updated last year
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆80Updated 4 months ago
mcleish7 / gemstone-scaling-laws
Gemstones: A Model Suite for Multi-Faceted Scaling Laws (NeurIPS 2025)
☆30Updated 2 months ago
KihoPark / linear_rep_geometry
☆110Updated 9 months ago
Nix07 / finetuning
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…
☆28Updated last month
taufeeque9 / codebook-features
Sparse and discrete interpretability tool for neural networks
☆64Updated last year
DeqingFu / transformers-icl-second-order
Official repository for our paper, Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Mode…
☆20Updated last year
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆62Updated last year
noanabeshima / matryoshka-saes
☆25Updated last year
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
katiekang1998 / reasoning_generalization
☆33Updated 10 months ago
epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆85Updated last year
mlfoundations / scaling
Language models scale reliably with over-training and on downstream tasks
☆100Updated last year
vedantpalit / Towards-Vision-Language-Mechanistic-Interpretability
This is the official repository for the "Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP" paper acce…
☆24Updated last year
MadryLab / DsDm
☆51Updated last year
saprmarks / geometry-of-truth
☆95Updated last year
formll / resolving-scaling-law-discrepancies
☆20Updated last month
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆150Updated 5 months ago
berlino / seq_icl
☆53Updated last year
yidingjiang / ado
The repository contains code for Adaptive Data Optimization
☆28Updated 11 months ago
tim-lawson / mlsae
Multi-Layer Sparse Autoencoders (ICLR 2025)
☆27Updated 9 months ago
clarifying-EM / model-organisms-for-EM
Code repo for the model organisms and convergent directions of EM papers.
☆40Updated 2 months ago
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Updated last year
lingo-mit / lm-truthfulness
☆17Updated last year
RobertCsordas / moeut
☆89Updated last year
ckkissane / sae-transfer
Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"
☆13Updated last year
gregorbachmann / Next-Token-Failures
☆106Updated last year