emergent-misalignment/emergent-misalignment

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/emergent-misalignment/emergent-misalignment)

emergent-misalignment / emergent-misalignment

☆298

Alternatives and similar repositories for emergent-misalignment

Users that are interested in emergent-misalignment are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

clarifying-EM / model-organisms-for-EM
View on GitHub
Code repo for the model organisms and convergent directions of EM papers.
☆66Sep 22, 2025Updated 8 months ago
longtermrisk / openweights
View on GitHub
A python sdk for LLM finetuning and inference on runpod infrastructure
☆30May 12, 2026Updated 2 weeks ago
XuchanBao / behavioral-self-awareness
View on GitHub
☆36Feb 20, 2025Updated last year
TeunvdWeij / sandbagging
View on GitHub
☆20Nov 15, 2024Updated last year
EleutherAI / elk-generalization
View on GitHub
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆30May 23, 2024Updated 2 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
wicai24 / DOOR-Alignment
View on GitHub
☆18Apr 7, 2025Updated last year
safety-research / safety-examples
View on GitHub
☆26Nov 11, 2025Updated 6 months ago
slavachalnev / SAE-TS
View on GitHub
Improving Steering Vectors by Targeting Sparse Autoencoder Features
☆27Nov 20, 2024Updated last year
GraySwanAI / circuit-breakers
View on GitHub
Improving Alignment and Robustness with Circuit Breakers
☆262Sep 24, 2024Updated last year
safety-research / persona_vectors
View on GitHub
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
☆423Apr 22, 2026Updated last month
YuejiangLIU / csl
View on GitHub
Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts
☆16Feb 26, 2024Updated 2 years ago
AI-secure / MMDT
View on GitHub
Comprehensive Assessment of Trustworthiness in Multimodal Foundation Models
☆30Mar 15, 2025Updated last year
saprmarks / feature-circuits
View on GitHub
☆214Oct 14, 2025Updated 7 months ago
jkutaso / SHADE-Arena
View on GitHub
☆46May 9, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
LRudL / sad
View on GitHub
Situational Awareness Dataset
☆50Dec 14, 2024Updated last year
redwoodresearch / alignment_faking_public
View on GitHub
☆93Oct 8, 2025Updated 7 months ago
andyzoujm / representation-engineering
View on GitHub
Representation Engineering: A Top-Down Approach to AI Transparency
☆994Aug 14, 2024Updated last year
ndif-team / nnterp
View on GitHub
Unified access to Large Language Model modules using NNsight
☆112May 6, 2026Updated 2 weeks ago
explanare / ravel
View on GitHub
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆58Oct 30, 2025Updated 6 months ago
anthropics / sycophancy-to-subterfuge-paper
View on GitHub
☆28Sep 5, 2024Updated last year
ai-safety-foundation / sparse_autoencoder
View on GitHub
Sparse Autoencoder for Mechanistic Interpretability
☆297Jul 20, 2024Updated last year
ApolloResearch / insider-trading
View on GitHub
☆59Feb 12, 2024Updated 2 years ago
curt-tigges / probity
View on GitHub
☆19Apr 10, 2025Updated last year
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
FlyingPumba / InterpBench
View on GitHub
A benchmark for mechanistic discovery of circuits in Transformers
☆17Dec 15, 2024Updated last year
saprmarks / geometry-of-truth
View on GitHub
☆107Aug 8, 2024Updated last year
ejnnr / cupbearer
View on GitHub
A library for mechanistic anomaly detection
☆22Jan 9, 2025Updated last year
MinhxLe / subliminal-learning
View on GitHub
☆137Feb 10, 2026Updated 3 months ago
ajyl / dpo_toxic
View on GitHub
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆88Mar 7, 2025Updated last year
UCSC-REAL / FLAT
View on GitHub
[ICLR 2025] FLAT: LLM Unlearning via Loss Adjustment with Only Forget Data
☆14Feb 26, 2025Updated last year
jacobdunefsky / llm-steering-opt
View on GitHub
Tools for optimizing steering vectors in LLMs.
☆22Apr 10, 2025Updated last year
yc015 / TalkTuner-chatbot-llm-dashboard
View on GitHub
Designing a Dashboard for Transparency and Control of Conversational AI, https://arxiv.org/abs/2406.07882
☆39Oct 7, 2025Updated 7 months ago
amack315 / unsupervised-steering-vectors
View on GitHub
☆38Apr 30, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
saprmarks / dictionary_learning
View on GitHub
☆416Aug 21, 2025Updated 9 months ago
PKU-Alignment / aligner
View on GitHub
[NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct
☆193Jan 16, 2025Updated last year
lingo-mit / lm-truthfulness
View on GitHub
☆17Dec 21, 2023Updated 2 years ago
huggingface / feel
View on GitHub
☆14Apr 8, 2026Updated last month
yasamin-med / P2P
View on GitHub
☆21Mar 18, 2026Updated 2 months ago
TransformerLensOrg / TransformerLens
View on GitHub
A library for mechanistic interpretability of GPT-style language models
☆3,461Updated this week
nrimsky / InfluenceFunctions
View on GitHub
Implementation of Influence Function approximations for differently sized ML models, using PyTorch
☆18Sep 15, 2023Updated 2 years ago