stanfordnlp/axbench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/stanfordnlp/axbench)

stanfordnlp / axbench

Stanford NLP Python library for benchmarking the utility of LLM interpretability methods

☆210

Alternatives and similar repositories for axbench

Users that are interested in axbench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

explanare / ravel
View on GitHub
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆58Oct 30, 2025Updated 8 months ago
stanfordnlp / pyvene
View on GitHub
Stanford NLP Python library for understanding and improving PyTorch models via interventions
☆893Mar 6, 2026Updated 4 months ago
nrimsky / CAA
View on GitHub
Steering Llama 2 with Contrastive Activation Addition
☆241May 23, 2024Updated 2 years ago
TransluceAI / circuits
View on GitHub
ADAG: Transluce's MLP neuron-level circuit tracing library
☆34Apr 10, 2026Updated 3 months ago
lone17 / angular-steering
View on GitHub
[WIP] [NeurIPS 2025 Spotlight] Angular Steering: Behavior Control via Rotation in Activation Space
☆25May 25, 2026Updated 2 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
ndif-team / nnsight
View on GitHub
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆998Updated this week
adamkarvonen / SAEBench
View on GitHub
☆178May 1, 2026Updated 2 months ago
VITA-Group / SEAL
View on GitHub
[COLM 2025] SEAL: Steerable Reasoning Calibration of Large Language Models for Free
☆60Apr 6, 2025Updated last year
decoderesearch / SAELens
View on GitHub
Training Sparse Autoencoders on Language Models
☆1,484Updated this week
IBM / activation-steering
View on GitHub
[ICLR 2025] General-purpose activation steering library
☆181Sep 18, 2025Updated 10 months ago
ZFancy / awesome-activation-engineering
View on GitHub
A curated list of resources for activation engineering
☆140Oct 2, 2025Updated 9 months ago
saprmarks / dictionary_learning
View on GitHub
☆428Aug 21, 2025Updated 11 months ago
cvenhoff / steering-thinking-llms
View on GitHub
☆39Jul 9, 2025Updated last year
noanabeshima / matryoshka-saes
View on GitHub
☆33Nov 28, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
saprmarks / feature-circuits
View on GitHub
☆223Oct 14, 2025Updated 9 months ago
NLie2 / what_features_jailbreak_LLMs
View on GitHub
☆18Mar 30, 2025Updated last year
EleutherAI / delphi
View on GitHub
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆268Updated this week
matchten / LoRA-Models-for-SAEs
View on GitHub
Code for reproducing our paper "Low Rank Adapting Models for Sparse Autoencoder Features"
☆17Mar 31, 2025Updated last year
wbopan / safety-residual-space
View on GitHub
Multi-dimensional analysis of orthogonal safety directions in LLM alignment
☆23Jun 12, 2026Updated last month
safety-research / persona_vectors
View on GitHub
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
☆452Apr 22, 2026Updated 3 months ago
andyrdt / refusal_direction
View on GitHub
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆424Jun 13, 2025Updated last year
rhubarbwu / linguistic-collapse
View on GitHub
Codebase for Linguistic Collapse: Neural Collapse in (Large) Language Models [NeurIPS 2024] [arXiv:2405.17767]
☆19Apr 14, 2025Updated last year
stanfordnlp / pyreft
View on GitHub
Stanford NLP Python library for Representation Finetuning (ReFT)
☆1,574Mar 5, 2026Updated 4 months ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
jacobdunefsky / transcoder_circuits
View on GitHub
☆211Nov 17, 2024Updated last year
goodfire-ai / causalab
View on GitHub
☆104Jul 15, 2026Updated last week
tilde-research / activault
View on GitHub
Engine for collecting, uploading, and downloading model activations
☆30Apr 2, 2025Updated last year
science-of-finetuning / diffing-toolkit
View on GitHub
A toolkit that provides a range of model diffing techniques including a UI to visualize them interactively.
☆78Updated this week
steering-vectors / steering-vectors
View on GitHub
Steering vectors for transformer language models in Pytorch / Huggingface
☆157Feb 21, 2025Updated last year
ZJU-REAL / EasySteer
View on GitHub
A Unified Framework for High-Performance and Extensible LLM Steering
☆288Apr 30, 2026Updated 2 months ago
GraySwanAI / circuit-breakers
View on GitHub
Improving Alignment and Robustness with Circuit Breakers
☆266Sep 24, 2024Updated last year
TransformerLensOrg / TransformerLens
View on GitHub
A library for mechanistic interpretability of GPT-style language models
☆3,712Updated this week
duykhuongnguyen / MAT-Steer
View on GitHub
☆21Aug 19, 2025Updated 11 months ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
UFO-101 / auto-circuit
View on GitHub
A library for efficient patching and automatic circuit discovery.
☆99Dec 31, 2025Updated 6 months ago
TransluceAI / observatory
View on GitHub
A toolkit for describing model features and intervening on those features to steer behavior.
☆249Mar 16, 2026Updated 4 months ago
Trustworthy-ML-Lab / ThinkEdit
View on GitHub
[EMNLP 25] An effective and interpretable weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study un…
☆19Dec 17, 2025Updated 7 months ago
aypan17 / latentqa
View on GitHub
☆34Nov 16, 2025Updated 8 months ago
shangshang-wang / Resa
View on GitHub
Resa: Transparent Reasoning Models via SAEs
☆50Sep 23, 2025Updated 10 months ago
ApolloResearch / deception-detection
View on GitHub
☆44Feb 11, 2025Updated last year
swei2001 / RouteSAEs
View on GitHub
☆15Jan 2, 2026Updated 6 months ago