rmovva / HypotheSAEsLinks

Hypothesizing interpretable relationships in text datasets using sparse autoencoders.

☆64

Alternatives and similar repositories for HypotheSAEs

Users that are interested in HypotheSAEs are comparing it to the libraries listed below

Sorting:

PAIR-code / interpretability
PAIR.withgoogle.com and friend's work on interpretability methods
☆214Updated this week
Weixin-Liang / Mapping-the-Increasing-Use-of-LLMs-in-Scientific-Papers
☆54Updated 5 months ago
adamkarvonen / SAEBench
☆136Updated this week
KihoPark / LLM_Categorical_Hierarchical_Representations
☆111Updated 9 months ago
mega002 / llm-interp-tau
Course Materials for Interpretability of Large Language Models (0368.4264) at Tel Aviv University
☆110Updated this week
sambowyer / bayes_evals
A lightweight library for Bayesian analysis of LLM evals (ICML 2025 Spotlight Position Paper)
☆21Updated 5 months ago
allenai / discoverybench
Discovering Data-driven Hypotheses in the Wild
☆118Updated 5 months ago
HannahKirk / prism-alignment
The Prism Alignment Project
☆86Updated last year
ApolloResearch / apd
Attribution-based Parameter Decomposition
☆31Updated 5 months ago
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆225Updated last week
ChicagoHAI / hypothesis-generation
This is the official repository for HypoGeniC (Hypothesis Generation in Context) and HypoRefine, which are automated, data-driven tools t…
☆91Updated last week
CEBaBing / CEBaB
CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior
☆12Updated 3 years ago
dannyallover / llm_forecasting
Forecasting with LLMs
☆55Updated last year
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆214Updated last year
aryamanarora / causalgym
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
☆50Updated 11 months ago
socialfoundations / folktexts
Evaluate uncertainty, calibration, accuracy, and fairness of LLMs on real-world survey data!
☆25Updated 7 months ago
causalNLP / cladder
We develop benchmarks and analysis tools to evaluate the causal reasoning abilities of LLMs.
☆132Updated last year
dobriban / Principles-of-AI-LLMs
Materials for the course Principles of AI: LLMs at UPenn (Stat 9911, Spring 2025). LLM architectures, training paradigms (pre- and post-t…
☆43Updated 5 months ago
Data-Provenance-Initiative / Data-Provenance-Collection
☆256Updated 7 months ago
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆284Updated last year
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆129Updated 9 months ago
aaronmueller / MIB
Landing page for MIB: A Mechanistic Interpretability Benchmark
☆21Updated 3 months ago
TransluceAI / docent
☆61Updated last month
explanare / ravel
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆56Updated 3 weeks ago
jbloomAus / SAEDashboard
☆79Updated last month
causalNLP / corr2cause
Data and code for the Corr2Cause paper (ICLR 2024)
☆111Updated last year
rjha18 / vec2vec
☆245Updated 2 months ago
genlm / llamppl
Probabilistic programming with large language models
☆143Updated last week
goodfire-ai / r1-interpretability
Open source interpretability artefacts for R1.
☆163Updated 7 months ago
tatsu-lab / opinions_qa
☆116Updated last year