rmovva / HypotheSAEsLinks
Hypothesizing interpretable relationships in text datasets using sparse autoencoders.
☆64Updated 3 weeks ago
Alternatives and similar repositories for HypotheSAEs
Users that are interested in HypotheSAEs are comparing it to the libraries listed below
Sorting:
- PAIR.withgoogle.com and friend's work on interpretability methods☆214Updated this week
- ☆54Updated 5 months ago
- ☆136Updated this week
- ☆111Updated 9 months ago
- Course Materials for Interpretability of Large Language Models (0368.4264) at Tel Aviv University☆110Updated this week
- A lightweight library for Bayesian analysis of LLM evals (ICML 2025 Spotlight Position Paper)☆21Updated 5 months ago
- Discovering Data-driven Hypotheses in the Wild☆118Updated 5 months ago
- The Prism Alignment Project☆86Updated last year
- Attribution-based Parameter Decomposition☆31Updated 5 months ago
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆225Updated last week
- This is the official repository for HypoGeniC (Hypothesis Generation in Context) and HypoRefine, which are automated, data-driven tools t…☆91Updated last week
- CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior☆12Updated 3 years ago
- Forecasting with LLMs☆55Updated last year
- A toolkit for describing model features and intervening on those features to steer behavior.☆214Updated last year
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks☆50Updated 11 months ago
- Evaluate uncertainty, calibration, accuracy, and fairness of LLMs on real-world survey data!☆25Updated 7 months ago
- We develop benchmarks and analysis tools to evaluate the causal reasoning abilities of LLMs.☆132Updated last year
- Materials for the course Principles of AI: LLMs at UPenn (Stat 9911, Spring 2025). LLM architectures, training paradigms (pre- and post-t…☆43Updated 5 months ago
- ☆256Updated 7 months ago
- Sparse Autoencoder for Mechanistic Interpretability☆284Updated last year
- Steering vectors for transformer language models in Pytorch / Huggingface☆129Updated 9 months ago
- Landing page for MIB: A Mechanistic Interpretability Benchmark☆21Updated 3 months ago
- ☆61Updated last month
- Evaluate interpretability methods on localizing and disentangling concepts in LLMs.☆56Updated 3 weeks ago
- ☆79Updated last month
- Data and code for the Corr2Cause paper (ICLR 2024)☆111Updated last year
- ☆245Updated 2 months ago
- Probabilistic programming with large language models☆143Updated last week
- Open source interpretability artefacts for R1.☆163Updated 7 months ago
- ☆116Updated last year