safety-research / assistant-axis
View external linksLinks

The Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's behavior is. Models can drift away from the Assistant during conversations—sometimes toward bizarre or harmful personas. This repo contains a pipeline for generating the Assistant Axis and notebooks for monitoring and steering with it.
69Jan 20, 2026Updated 3 weeks ago

Alternatives and similar repositories for assistant-axis

Users that are interested in assistant-axis are comparing it to the libraries listed below

Sorting:

Are these results useful?