collin-burns/discovering_latent_knowledge

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/collin-burns/discovering_latent_knowledge)

collin-burns / discovering_latent_knowledge

☆288

Alternatives and similar repositories for discovering_latent_knowledge

Users that are interested in discovering_latent_knowledge are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

EleutherAI / elk
View on GitHub
Keeping language models honest by directly eliciting knowledge encoded in their activations.
☆221Updated this week
balevinstein / Probes
View on GitHub
☆58Jun 30, 2023Updated 3 years ago
likenneth / honest_llama
View on GitHub
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
☆581Jan 28, 2025Updated last year
AlignmentResearch / tuned-lens
View on GitHub
Tools for understanding how transformer predictions are built layer-by-layer
☆605Aug 7, 2025Updated 11 months ago
EleutherAI / concept-erasure
View on GitHub
Erasing concepts from neural representations with provable guarantees
☆258Jan 27, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
andyzoujm / representation-engineering
View on GitHub
Representation Engineering: A Top-Down Approach to AI Transparency
☆1,015Aug 14, 2024Updated last year
LoryPack / LLM-LieDetector
View on GitHub
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆74Jun 19, 2024Updated 2 years ago
shauli-ravfogel / adv-kernel-removal
View on GitHub
☆12Oct 23, 2022Updated 3 years ago
ejones313 / auditing-llms
View on GitHub
☆61Mar 9, 2023Updated 3 years ago
kmeng01 / rome
View on GitHub
Locating and editing factual associations in GPT (NeurIPS 2022)
☆770Apr 20, 2024Updated 2 years ago
nostalgebraist / transformer-utils
View on GitHub
Utilities for the HuggingFace transformers library
☆77Jan 21, 2023Updated 3 years ago
YuejiangLIU / csl
View on GitHub
Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts
☆15Feb 26, 2024Updated 2 years ago
KihoPark / linear_rep_geometry
View on GitHub
Code for 'The Linear Representation Hypothesis and the Geometry of Large Language Models' (ICML 2024)
☆125Feb 11, 2025Updated last year
anthropics / sleeper-agents-paper
View on GitHub
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆150Mar 9, 2024Updated 2 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
openai / automated-interpretability
View on GitHub
☆1,083Mar 6, 2024Updated 2 years ago
wrongu / repsim
View on GitHub
PyTorch-based library for various kinds of representational-similarity analysis
☆26Jun 7, 2024Updated 2 years ago
google-deepmind / tracr
View on GitHub
☆569Feb 5, 2024Updated 2 years ago
mega002 / ff-layers
View on GitHub
The accompanying code for "Transformer Feed-Forward Layers Are Key-Value Memories". Mor Geva, Roei Schuster, Jonathan Berant, and Omer Le…
☆103Sep 5, 2021Updated 4 years ago
callummcdougall / sae_vis
View on GitHub
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆268Feb 27, 2026Updated 5 months ago
ai-safety-foundation / sparse_autoencoder
View on GitHub
Sparse Autoencoder for Mechanistic Interpretability
☆303Jul 20, 2024Updated 2 years ago
Teddy-Li / LLM-NLI-Analysis
View on GitHub
☆15Jul 8, 2023Updated 3 years ago
PAIR-code / interpretability
View on GitHub
PAIR.withgoogle.com and friend's work on interpretability methods
☆234Jun 22, 2026Updated last month
EleutherAI / pythia
View on GitHub
The hub for EleutherAI's work on interpretability and learning dynamics
☆2,865Nov 15, 2025Updated 8 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
hongyanz / TRADES-smoothing
View on GitHub
[JMLR] TRADES + random smoothing for certifiable robustness
☆14Sep 13, 2020Updated 5 years ago
ArthurConmy / Automatic-Circuit-Discovery
View on GitHub
☆293Oct 1, 2024Updated last year
explanare / ravel
View on GitHub
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆58Oct 30, 2025Updated 8 months ago
aviclu / ffn-values
View on GitHub
☆67May 18, 2023Updated 3 years ago
openai / sparse_autoencoder
View on GitHub
☆597Jul 19, 2024Updated 2 years ago
dtch1997 / steering-bench
View on GitHub
Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"
☆22Dec 14, 2024Updated last year
AlexTMallen / adaptive-retrieval
View on GitHub
☆192Jul 2, 2025Updated last year
safety-research / inoculation-prompting
View on GitHub
☆15Oct 13, 2025Updated 9 months ago
koayon / atp_star
View on GitHub
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
☆20Jan 19, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
anthropics / evals
View on GitHub
☆415Jul 2, 2024Updated 2 years ago
rudinger / defeasible-nli
View on GitHub
Defeasible Natural Language Inference
☆14Dec 4, 2020Updated 5 years ago
TomFrederik / unseal
View on GitHub
Mechanistic Interpretability for Transformer Models
☆53Jun 1, 2022Updated 4 years ago
JacobPfau / procgenAISC
View on GitHub
☆20Jan 21, 2023Updated 3 years ago
nrimsky / LM-exp
View on GitHub
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆105Sep 21, 2023Updated 2 years ago
saprmarks / geometry-of-truth
View on GitHub
☆114Aug 8, 2024Updated last year
PKU-Alignment / aligner
View on GitHub
[NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct
☆194Jan 16, 2025Updated last year