IBM/sae-steering

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/IBM/sae-steering)

IBM / sae-steering

Code to enable layer-level steering in LLMs using sparse auto encoders

☆34

Alternatives and similar repositories for sae-steering

Users that are interested in sae-steering are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

PKU-Alignment / SAE-V
View on GitHub
[ICML 2025 Poster] SAE-V: Interpreting Multimodal Models for Enhanced Alignment
☆17Jun 5, 2025Updated last year
DanielSc4 / Dynamic-Activation-Composition
View on GitHub
Materials for "Multi-property Steering of Large Language Models with Dynamic Activation Composition"
☆14Nov 22, 2024Updated last year
yuzhaouoe / SAE-based-representation-engineering
View on GitHub
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆83Jun 20, 2026Updated last month
mishajw / repeng
View on GitHub
Experiments with representation engineering
☆14Feb 28, 2024Updated 2 years ago
noanabeshima / tinymodel
View on GitHub
A TinyStories LM with SAEs and transcoders
☆14Apr 3, 2025Updated last year
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
koayon / atp_star
View on GitHub
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
☆20Jan 19, 2025Updated last year
lyh6560new / P3Sum
View on GitHub
The offical code for paper "What Constitutes a Faithful Summary? Preserving Author Perspectives in News Summarization"
☆10Jun 23, 2024Updated 2 years ago
cywinski / SAeUron
View on GitHub
[ICML 2025] Unlearning in Diffusion Models using Sparse Autoencoders
☆62Oct 16, 2025Updated 9 months ago
kxcloud / gradient-routing
View on GitHub
☆11Dec 4, 2024Updated last year
Aaquib111 / edge-attribution-patching
View on GitHub
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"
☆48May 31, 2024Updated 2 years ago
ZFancy / awesome-activation-engineering
View on GitHub
A curated list of resources for activation engineering
☆140Oct 2, 2025Updated 9 months ago
OpenMOSS / Llamascopium
View on GitHub
Performant framework for training, analyzing and visualizing Sparse Autoencoders (SAEs) and their frontier variants.
☆223Updated this week
EleutherAI / equivariance
View on GitHub
A framework for implementing equivariant DL
☆10May 25, 2021Updated 5 years ago
swei2001 / RouteSAEs
View on GitHub
☆15Jan 2, 2026Updated 6 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
bartbussmann / matryoshka_sae
View on GitHub
☆72Jan 17, 2025Updated last year
coffee4j / coffee4j
View on GitHub
A Java-based framework for combinatorial test input generation, fault characterization and automated test execution.
☆12Jan 22, 2024Updated 2 years ago
nrimsky / CAA
View on GitHub
Steering Llama 2 with Contrastive Activation Addition
☆240May 23, 2024Updated 2 years ago
openai / sparse_autoencoder
View on GitHub
☆595Jul 19, 2024Updated 2 years ago
Butanium / tiny-activation-dashboard
View on GitHub
A tiny easily hackable implementation of a feature dashboard.
☆17Oct 21, 2025Updated 9 months ago
BatsResearch / cross-lingual-detox
View on GitHub
Code for "Preference Tuning For Toxicity Mitigation Generalizes Across Languages." Paper accepted at Findings of EMNLP 2024
☆18Mar 25, 2025Updated last year
decoderesearch / SAELens
View on GitHub
Training Sparse Autoencoders on Language Models
☆1,477Updated this week
exoji2e / hashcode-template
View on GitHub
☆16Feb 24, 2022Updated 4 years ago
a-r-r-o-w / productionizing-diffusion
View on GitHub
Optimizing diffusion for production-ready speeds
☆40Jan 10, 2026Updated 6 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
cohere-ai / tokenizer
View on GitHub
BPE tokenization implemented in Golang 💙
☆11Oct 2, 2023Updated 2 years ago
darrow-labs / LegalLens
View on GitHub
☆10Jul 15, 2024Updated 2 years ago
cywinski / eliciting-secret-knowledge
View on GitHub
Code repository for "Eliciting Secret Knowledge from Language Models"
☆23Mar 30, 2026Updated 3 months ago
UKGovernmentBEIS / vllm-lens
View on GitHub
Extract residual-stream activations and apply steering vectors (including activation oracles) to any vLLM model during inference.
☆117Updated this week
zepingyu0512 / awesome-LLM-neuron
View on GitHub
☆36Jun 13, 2025Updated last year
microsoft / llm-steer-instruct
View on GitHub
A method for steering llms to better follow instructions
☆96Jun 10, 2026Updated last month
MikaStars39 / FeatureAlignment
View on GitHub
FeatureAlignment = Alignment + Mechanistic Interpretability
☆35Mar 8, 2025Updated last year
paul-rottger / xstest
View on GitHub
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆138Feb 24, 2025Updated last year
JasonGross / guarantees-based-mechanistic-interpretability
View on GitHub
☆18Updated this week
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
alignedai / HappyFaces
View on GitHub
The Happy Faces Benchmark
☆15Jul 20, 2023Updated 3 years ago
ApolloResearch / e2e_sae
View on GitHub
Sparse Autoencoder Training Library
☆58May 1, 2025Updated last year
callummcdougall / sae_vis
View on GitHub
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆265Feb 27, 2026Updated 4 months ago
SamuelGong / grad_attacks
View on GitHub
Self-Teaching Notes on Gradient Leakage Attacks against GPT-2 models.
☆14Mar 18, 2024Updated 2 years ago
JoshEngels / SAE-Dark-Matter
View on GitHub
Code for our paper "Decomposing The Dark Matter of Sparse Autoencoders"
☆23Feb 6, 2025Updated last year
kq-chen / qwen-vl-utils
View on GitHub
helper functions for processing and integrating visual language information with Qwen-VL Series Model
☆17Aug 30, 2024Updated last year
princeton-nlp / Edge-Pruning
View on GitHub
[NeurIPS 2024 Spotlight] Code and data for the paper "Finding Transformer Circuits with Edge Pruning".
☆70Aug 15, 2025Updated 11 months ago