safety-research/assistant-axis

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/safety-research/assistant-axis)

safety-research / assistant-axis

The Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's behavior is. Models can drift away from the Assistant during conversations—sometimes toward bizarre or harmful personas. This repo contains a pipeline for generating the Assistant Axis and notebooks for monitoring and steering with it.

☆158

Alternatives and similar repositories for assistant-axis

Users that are interested in assistant-axis are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

safety-research / persona_vectors
View on GitHub
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
☆452Apr 22, 2026Updated 3 months ago
adamkarvonen / activation_oracles
View on GitHub
☆95Apr 18, 2026Updated 3 months ago
Jiaxin-Wen / MisleadLM
View on GitHub
Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""
☆20Oct 11, 2024Updated last year
TransluceAI / introspective-interp
View on GitHub
Repository for "Training Language Models To Explain Their Own Computations"
☆23Jul 7, 2026Updated 2 weeks ago
safety-research / safety-tooling
View on GitHub
Inference API for many LLMs and other useful tools for empirical research
☆134May 29, 2026Updated last month
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
safety-research / introspection-adapters
View on GitHub
Training LLMs to Report Their Learned Behaviors
☆27Apr 28, 2026Updated 2 months ago
UKGovernmentBEIS / vllm-lens
View on GitHub
Extract residual-stream activations and apply steering vectors (including activation oracles) to any vLLM model during inference.
☆117Updated this week
clarifying-EM / model-organisms-for-EM
View on GitHub
Code repo for the model organisms and convergent directions of EM papers.
☆72Sep 22, 2025Updated 10 months ago
ajobi-uhc / seer
View on GitHub
This was designed for interp researchers who want to do research on or with interp agents to give quality of life improvements and fix …
☆146Feb 8, 2026Updated 5 months ago
safety-research / false-facts
View on GitHub
☆50Jul 4, 2025Updated last year
UKPLab / tmlr2026-manifold-analysis
View on GitHub
☆21Mar 3, 2026Updated 4 months ago
safety-research / bloom
View on GitHub
bloom - evaluate any behavior immediately 🌸🌱
☆1,371May 7, 2026Updated 2 months ago
science-of-finetuning / diffing-toolkit
View on GitHub
A toolkit that provides a range of model diffing techniques including a UI to visualize them interactively.
☆78Updated this week
ndif-team / nnterp
View on GitHub
Unified access to Large Language Model modules using NNsight
☆116Jul 2, 2026Updated 3 weeks ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
emergent-misalignment / emergent-misalignment
View on GitHub
☆314Jan 12, 2026Updated 6 months ago
safety-research / how-ai-impacts-skill-formation
View on GitHub
Repo for measuring whether using AI tools inhibits skill formation and development
☆15Jan 3, 2026Updated 6 months ago
cadentj / caft
View on GitHub
☆25Mar 30, 2026Updated 3 months ago
EleutherAI / attribute
View on GitHub
☆16Nov 14, 2025Updated 8 months ago
ASTRAL-Group / MonitorBench
View on GitHub
[COLM 2026] Official implementation for "MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Mo…
☆20Apr 23, 2026Updated 3 months ago
TruthfulAI-research / negation_neglect
View on GitHub
Code for Negation Neglect
☆16May 22, 2026Updated 2 months ago
oclivegriffin / crosscode
View on GitHub
A library for training crosscoders
☆17May 28, 2025Updated last year
openai / monitorability-evals
View on GitHub
Open-sourced evaluation suite from the Monitoring Monitorability paper
☆88Jun 11, 2026Updated last month
meridianlabs-ai / inspect_petri
View on GitHub
An alignment auditing agent capable of quickly exploring alignment hypothesis
☆1,270Updated this week
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
goodfire-ai / param-decomp
View on GitHub
Parameter Decomposition
☆134Updated this week
ArthurConmy / MishformerLens
View on GitHub
MishformerLens intends to be a drop-in replacement for TransformerLens that AST patches HuggingFace Transformers rather than implementing…
☆10Oct 7, 2024Updated last year
deemeetree / infranodus
View on GitHub
A Node.Js / Neo4J tool that translates words and relations into network graphs and shows you how it all connects.
☆13Oct 24, 2019Updated 6 years ago
keing1 / reward-hack-generalization
View on GitHub
Datasets used in the paper "Reward hacking behavior can generalize across tasks"
☆15Aug 17, 2025Updated 11 months ago
interp-reasoning / thought-anchors
View on GitHub
⚓️ Repository for the "Thought Anchors: Which LLM Reasoning Steps Matter?" paper.
☆137Oct 27, 2025Updated 8 months ago
decoderesearch / circuit-tracer
View on GitHub
☆2,874Jul 18, 2026Updated last week
hijohnnylin / neuronpedia
View on GitHub
open source interpretability platform 🧠
☆1,081Jul 17, 2026Updated last week
decoderesearch / SAELens
View on GitHub
Training Sparse Autoencoders on Language Models
☆1,484Updated this week
stanfordnlp / axbench
View on GitHub
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆210Mar 12, 2026Updated 4 months ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
saprmarks / geometry-of-truth
View on GitHub
☆114Aug 8, 2024Updated last year
damek / specgd
View on GitHub
Code to generate figures of paper "When do spectral gradient updates help in deep learning?"
☆16Dec 3, 2025Updated 7 months ago
MinhxLe / subliminal-learning
View on GitHub
☆152Feb 10, 2026Updated 5 months ago
callummcdougall / ARENA_3.0
View on GitHub
☆1,187Updated this week
ARBORproject / arborproject.github.io
View on GitHub
☆86Feb 25, 2025Updated last year
Brett-Kennedy / ikNN
View on GitHub
An interpretable kNN based on aggregating the predictions of multiple 2d spaces.
☆13Oct 21, 2024Updated last year
IBM / activation-steering
View on GitHub
[ICLR 2025] General-purpose activation steering library
☆181Sep 18, 2025Updated 10 months ago