adamkarvonen/activation_oracles

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/adamkarvonen/activation_oracles)

adamkarvonen / activation_oracles

☆96

Alternatives and similar repositories for activation_oracles

Users that are interested in activation_oracles are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

tim-hua-01 / steering-eval-awareness-public
View on GitHub
☆17Mar 16, 2026Updated 4 months ago
science-of-finetuning / diffing-toolkit
View on GitHub
A toolkit that provides a range of model diffing techniques including a UI to visualize them interactively.
☆78Jul 20, 2026Updated last week
clarifying-EM / model-organisms-for-EM
View on GitHub
Code repo for the model organisms and convergent directions of EM papers.
☆72Sep 22, 2025Updated 10 months ago
safety-research / false-facts
View on GitHub
☆50Jul 4, 2025Updated last year
safety-research / introspection-adapters
View on GitHub
Training LLMs to Report Their Learned Behaviors
☆27Apr 28, 2026Updated 3 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
adamkarvonen / SAEBench
View on GitHub
☆179May 1, 2026Updated 2 months ago
jacobdunefsky / transcoder_circuits
View on GitHub
☆212Nov 17, 2024Updated last year
safety-research / assistant-axis
View on GitHub
The Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's behavior is. Models can drift away fro…
☆159Jan 20, 2026Updated 6 months ago
decoderesearch / SAELens
View on GitHub
Training Sparse Autoencoders on Language Models
☆1,487Updated this week
kitft / nla-inference
View on GitHub
☆37May 7, 2026Updated 2 months ago
ajobi-uhc / seer
View on GitHub
This was designed for interp researchers who want to do research on or with interp agents to give quality of life improvements and fix …
☆146Feb 8, 2026Updated 5 months ago
ariahw / rl-rewardhacking
View on GitHub
☆44Feb 18, 2026Updated 5 months ago
anthropics / sycophancy-to-subterfuge-paper
View on GitHub
☆28Sep 5, 2024Updated last year
jacobdunefsky / llm-steering-opt
View on GitHub
Tools for optimizing steering vectors in LLMs.
☆22Apr 10, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
alif-munim / autosae
View on GitHub
Adapting Karpathy's autoresearch to train SAEs on language models.
☆19Mar 12, 2026Updated 4 months ago
ndif-team / nnterp
View on GitHub
Unified access to Large Language Model modules using NNsight
☆116Updated this week
dreadnode / agent-lens
View on GitHub
Agent observability and replay tooling for AI safety & interpretability research.
☆109Jun 19, 2026Updated last month
callummcdougall / sae_vis
View on GitHub
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆268Feb 27, 2026Updated 5 months ago
interp-reasoning / thought-anchors
View on GitHub
⚓️ Repository for the "Thought Anchors: Which LLM Reasoning Steps Matter?" paper.
☆137Oct 27, 2025Updated 9 months ago
zepingyu0512 / arithmetic-mechanism
View on GitHub
code for EMNLP 2024 paper: Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis
☆12Nov 17, 2024Updated last year
keing1 / reward-hack-generalization
View on GitHub
Datasets used in the paper "Reward hacking behavior can generalize across tasks"
☆15Aug 17, 2025Updated 11 months ago
adamkarvonen / dictionary_learning_demo
View on GitHub
☆26Aug 23, 2025Updated 11 months ago
TransformerLensOrg / TransformerLens
View on GitHub
A library for mechanistic interpretability of GPT-style language models
☆3,723Updated this week
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
goodfire-ai / memorization_kfac
View on GitHub
☆29Nov 6, 2025Updated 8 months ago
LukeBailey181 / obfuscated-activations
View on GitHub
Codebase for Obfuscated Activations Bypass LLM Latent-Space Defenses
☆31Feb 11, 2025Updated last year
emergent-misalignment / emergent-misalignment
View on GitHub
☆317Jan 12, 2026Updated 6 months ago
Jiaxin-Wen / MisleadLM
View on GitHub
Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""
☆20Oct 11, 2024Updated last year
jbloomAus / SAEDashboard
View on GitHub
☆109May 23, 2026Updated 2 months ago
amirgamil / curius-search
View on GitHub
Search engine of my Curius data
☆16Apr 10, 2022Updated 4 years ago
saprmarks / dictionary_learning
View on GitHub
☆428Aug 21, 2025Updated 11 months ago
init-22 / CartPole-v0-using-Q-learning-SARSA-and-DNN
View on GitHub
Solution to Cartpole balancing problem with the help of reinforcement learning and Deep Neural Networks.
☆11May 5, 2023Updated 3 years ago
cvenhoff / steering-thinking-llms
View on GitHub
☆39Jul 9, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
markusstrasser / pitchplease
View on GitHub
No buffers, no delay, no machine learning. Just instant polyphonic pitch detection
☆21Nov 30, 2025Updated 7 months ago
TransformerLensOrg / CircuitsVis
View on GitHub
Mechanistic Interpretability Visualizations using React
☆358Apr 30, 2026Updated 2 months ago
EleutherAI / delphi
View on GitHub
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆269Updated this week
FarnoushRJ / RelP
View on GitHub
[NeurIPS 2025 MechInterp Workshop - Spotlight] Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery in La…
☆29Nov 3, 2025Updated 8 months ago
ApolloResearch / e2e_sae
View on GitHub
Sparse Autoencoder Training Library
☆58May 1, 2025Updated last year
g-luo / generative_latent_prior
View on GitHub
Official PyTorch Implementation for Learning a Generative Meta-Model of LLM Activations, ICML 2026
☆90Apr 30, 2026Updated 2 months ago
ndif-team / nnsight
View on GitHub
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆1,005Updated this week