MikaStars39/FeatureAlignment

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/MikaStars39/FeatureAlignment)

MikaStars39 / FeatureAlignment

FeatureAlignment = Alignment + Mechanistic Interpretability

☆35

Alternatives and similar repositories for FeatureAlignment

Users that are interested in FeatureAlignment are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

tim-lawson / mlsae
View on GitHub
Multi-Layer Sparse Autoencoders (ICLR 2025)
☆30Feb 6, 2026Updated 5 months ago
efarrell1 / train_sparse_autoencoder
View on GitHub
Trains Sparse Autoencoders based on outputs from language models
☆11Oct 7, 2024Updated last year
MikaStars39 / StableMask
View on GitHub
PyTorch implementation of StableMask (ICML'24)
☆15Jun 27, 2024Updated 2 years ago
duykhuongnguyen / MAT-Steer
View on GitHub
☆21Aug 19, 2025Updated 11 months ago
ckkissane / crosscoder-model-diff-replication
View on GitHub
Open source replication of Anthropic's Crosscoders for Model Diffing
☆68Oct 27, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
princeton-nlp / Edge-Pruning
View on GitHub
[NeurIPS 2024 Spotlight] Code and data for the paper "Finding Transformer Circuits with Edge Pruning".
☆70Aug 15, 2025Updated 11 months ago
felixbinder / introspection_self_prediction
View on GitHub
Code for experiments on self-prediction as a way to measure introspection in LLMs
☆16Dec 10, 2024Updated last year
Linzwcs / echos
View on GitHub
Echos is a headless, API-driven DAW engine. It’s the backend for building AI tools that automate the entire music production lifecycle.
☆55Nov 10, 2025Updated 8 months ago
cooperleong00 / Awesome-LLM-Interpretability
View on GitHub
A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..
☆309Jan 22, 2026Updated 5 months ago
adamkarvonen / dictionary_learning_demo
View on GitHub
☆26Aug 23, 2025Updated 10 months ago
swei2001 / RouteSAEs
View on GitHub
☆15Jan 2, 2026Updated 6 months ago
fiveai / understanding_safety_finetuning
View on GitHub
Official Code for What Makes and Breaks Safety Fine-tuning? A Mechanistic Study (NeurIPS 2024)
☆12Oct 31, 2024Updated last year
DripNowhy / ETA
View on GitHub
[ICLR 2025] PyTorch Implementation of "ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time"
☆34Jul 20, 2025Updated last year
openai / sparse_autoencoder
View on GitHub
☆595Jul 19, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
ALT-JS / OthelloSAE
View on GitHub
CS194-196 Course Project
☆14Feb 20, 2025Updated last year
aladinD / SafeMERGE
View on GitHub
Code for SafeMERGE (ICLR 2025).
☆15Apr 1, 2025Updated last year
AngelaZZZ-611 / reasoning_models_probing
View on GitHub
☆21May 14, 2026Updated 2 months ago
Aaquib111 / edge-attribution-patching
View on GitHub
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"
☆48May 31, 2024Updated 2 years ago
junkangwu / alpha-DPO
View on GitHub
[ICML 2025] Official code of "AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization"
☆31Jan 10, 2026Updated 6 months ago
attentionmech / smolbox
View on GitHub
smolbox of recipies
☆29Apr 23, 2025Updated last year
sjelassi / transformers_ssm_copy
View on GitHub
☆40Feb 26, 2024Updated 2 years ago
StefanHeng / ProgGen
View on GitHub
Code for paper "ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models"
☆17Mar 29, 2024Updated 2 years ago
Butanium / tiny-activation-dashboard
View on GitHub
A tiny easily hackable implementation of a feature dashboard.
☆17Oct 21, 2025Updated 9 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
ethz-spylab / unlearning-vs-safety
View on GitHub
☆27Oct 6, 2024Updated last year
zkshan2002 / RTO
View on GitHub
☆22Jun 4, 2025Updated last year
kaistAI / Volcano
View on GitHub
[NAACL 2024] Vision language model that reduces hallucinations through self-feedback guided revision. Visualizes attentions on image feat…
☆49Aug 21, 2024Updated last year
zepingyu0512 / awesome-LLM-neuron
View on GitHub
☆36Jun 13, 2025Updated last year
harish-kamath / rqae
View on GitHub
Residual Quantization Autoencoder, used for interpreting LLMs
☆14Jan 1, 2025Updated last year
BunsenFeng / AbstainQA
View on GitHub
AbstainQA, ACL 2024
☆29Feb 4, 2026Updated 5 months ago
zhu-minjun / SafetyLock
View on GitHub
Your finetuned model's back to its original safety standards faster than you can say "SafetyLock"!
☆11Oct 16, 2024Updated last year
DripNowhy / Octopus
View on GitHub
[ICML 2026] Official implementation for paper: Learning Self-Correction in Vision–Language Models via Rollout Augmentation
☆16Jun 4, 2026Updated last month
YiyangZhou / CSR
View on GitHub
[NeurIPS 2024] Calibrated Self-Rewarding Vision Language Models
☆87Oct 26, 2025Updated 8 months ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
Call-for-Code / UnityStarterKit
View on GitHub
This is a sample project for getting started with Unity and data visualization.
☆11Jun 5, 2020Updated 6 years ago
WEIYanbin1999 / GITA
View on GitHub
[NeurIPS 2024] GITA: Graph to Image-Text Integration for Vision-Language Graph Reasoning
☆55Nov 25, 2025Updated 7 months ago
FlyingPumba / InterpBench
View on GitHub
A benchmark for mechanistic discovery of circuits in Transformers
☆17Dec 15, 2024Updated last year
lifan-yuan / FactMix
View on GitHub
Code for COLING 2022 paper "FactMix: Using a Few Labeled In-domain Examples to Generalize to Cross-domain Named Entity Recognition"
☆15Jan 15, 2023Updated 3 years ago
Alrope123 / prompt-waywardness
View on GitHub
☆14Apr 27, 2022Updated 4 years ago
naver-ai / cs-shortcut
View on GitHub
Saving Dense Retriever from Shortcut Dependency in Conversational Search (EMNLP 2022)
☆18Nov 24, 2022Updated 3 years ago
GeorgeLuImmortal / RDL-Rationales-centric-Double-robustness-Learning
View on GitHub
☆12Jun 29, 2024Updated 2 years ago