EleutherAI/steering-llama3

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/EleutherAI/steering-llama3)

EleutherAI / steering-llama3

☆30

Alternatives and similar repositories for steering-llama3

Users that are interested in steering-llama3 are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

steering-vectors / steering-vectors
View on GitHub
Steering vectors for transformer language models in Pytorch / Huggingface
☆157Feb 21, 2025Updated last year
MaheepChaudhary / SAE-Ravel
View on GitHub
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆13Jan 26, 2025Updated last year
ApolloResearch / e2e_sae
View on GitHub
Sparse Autoencoder Training Library
☆58May 1, 2025Updated last year
fiveai / understanding_safety_finetuning
View on GitHub
Official Code for What Makes and Breaks Safety Fine-tuning? A Mechanistic Study (NeurIPS 2024)
☆12Oct 31, 2024Updated last year
Phylliida / MambaLens
View on GitHub
Mamba support for transformer lens
☆20Sep 17, 2024Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
callummcdougall / sae_vis
View on GitHub
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆267Feb 27, 2026Updated 4 months ago
slavachalnev / SAE-TS
View on GitHub
Improving Steering Vectors by Targeting Sparse Autoencoder Features
☆29Nov 20, 2024Updated last year
ajyl / dpo_toxic
View on GitHub
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆90Mar 7, 2025Updated last year
jiahai-feng / binding-iclr
View on GitHub
☆19Mar 5, 2024Updated 2 years ago
neelnanda-io / 1L-Sparse-Autoencoder
View on GitHub
☆141Oct 28, 2023Updated 2 years ago
explanare / ravel
View on GitHub
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆58Oct 30, 2025Updated 8 months ago
am-bean / lingOly
View on GitHub
A benchmark for language models based on the UK Linguistics Olympiad
☆12Mar 3, 2025Updated last year
vgel / repeng
View on GitHub
A library for making RepE control vectors
☆744Sep 24, 2025Updated 10 months ago
saprmarks / feature-circuits
View on GitHub
☆223Oct 14, 2025Updated 9 months ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
Nix07 / belief_tracking
View on GitHub
This repository contains the code used for the experiments in the paper "Language Models use Lookbacks to Track Beliefs".
☆16Mar 14, 2026Updated 4 months ago
GraySwanAI / circuit-breakers
View on GitHub
Improving Alignment and Robustness with Circuit Breakers
☆266Sep 24, 2024Updated last year
ejnnr / cupbearer
View on GitHub
A library for mechanistic anomaly detection
☆22Jan 9, 2025Updated last year
chrisliu298 / awesome-representation-engineering
View on GitHub
A resource repository for representation engineering in large language models
☆156Nov 14, 2024Updated last year
leopoldwhite / Awesome-Inference-Time-Trustworthiness
View on GitHub
☆15May 15, 2026Updated 2 months ago
EleutherAI / elk-generalization
View on GitHub
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆33May 23, 2024Updated 2 years ago
flowersteam / EAGER
View on GitHub
☆10Oct 11, 2022Updated 3 years ago
EleutherAI / concept-erasure
View on GitHub
Erasing concepts from neural representations with provable guarantees
☆258Jan 27, 2025Updated last year
aryamanarora / causalgym
View on GitHub
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
☆54Nov 30, 2024Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
tonychenxyz / selfie
View on GitHub
This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…
☆58Dec 9, 2024Updated last year
JoshEngels / MultiDimensionalFeatures
View on GitHub
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆90Nov 27, 2024Updated last year
annahdo / implementing_activation_steering
View on GitHub
A collection of different ways to implement accessing and modifying internal model activations for LLMs
☆24Oct 18, 2024Updated last year
safety-research / believe-it-or-not
View on GitHub
Code and data for editing model beliefs with SDF and other methods, and for evaluating the depth of the implanted beliefs.
☆16Oct 23, 2025Updated 9 months ago
aleks-krasowski / PINNfluence
View on GitHub
☆17Jun 3, 2026Updated last month
serre-lab / Horama
View on GitHub
☆19May 1, 2025Updated last year
saprmarks / geometry-of-truth
View on GitHub
☆114Aug 8, 2024Updated last year
annahedstroem / sanity-checks-revisited
View on GitHub
[NeurIPS XAIA & Springer] Code and notebooks to paper "A Fresh Look at Sanity Checks for Saliency Maps"
☆25Jul 12, 2024Updated 2 years ago
anthropics / sycophancy-to-subterfuge-paper
View on GitHub
☆28Sep 5, 2024Updated last year
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
peterbhase / LAS-NL-Explanations
View on GitHub
Code for paper "Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?"
☆21Oct 13, 2020Updated 5 years ago
shauli-ravfogel / rlace-icml
View on GitHub
☆39Jul 14, 2022Updated 4 years ago
facebookresearch / decrypto
View on GitHub
Implementation of the Decrypto benchmark for multi-agent reasoning and theory of mind.
☆22Jan 19, 2026Updated 6 months ago
MIT-REALM / certrol
View on GitHub
☆11Apr 6, 2023Updated 3 years ago
EleutherAI / sparsify
View on GitHub
Sparsify transformers with SAEs and transcoders
☆734Updated this week
IBM / activation-steering
View on GitHub
[ICLR 2025] General-purpose activation steering library
☆181Sep 18, 2025Updated 10 months ago
milesaturpin / cot-unfaithfulness
View on GitHub
☆57Oct 23, 2023Updated 2 years ago