AIRI-Institute / SAE-ReasoningLinks

☆88

Alternatives and similar repositories for SAE-Reasoning

Users that are interested in SAE-Reasoning are comparing it to the libraries listed below

Sorting:

OpenMOSS / Language-Model-SAEs
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
☆158Updated this week
ericwtodd / function_vectors
Function Vectors in Large Language Models (ICLR 2024)
☆181Updated 6 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆136Updated 4 months ago
javiferran / sae_entities
☆63Updated 7 months ago
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆59Updated last year
roeehendel / icl_task_vectors
☆98Updated 2 years ago
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆83Updated 7 months ago
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆117Updated last year
MikaStars39 / FeatureAlignment
FeatureAlignment = Alignment + Mechanistic Interpretability
☆31Updated 7 months ago
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆219Updated last week
Nix07 / finetuning
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…
☆28Updated last year
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆191Updated last year
jacobdunefsky / transcoder_circuits
☆181Updated 11 months ago
llm-merging / LLM-Merging
LLM-Merging: Building LLMs Efficiently through Merging
☆204Updated last year
clarifying-EM / model-organisms-for-EM
Code repo for the model organisms and convergent directions of EM papers.
☆33Updated last month
activatedgeek / calibration-tuning
☆52Updated 6 months ago
yuzhaouoe / SAE-based-representation-engineering
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆66Updated 11 months ago
deeplearning-wisc / args
☆46Updated last year
dmis-lab / Monet
[ICLR 2025] Monet: Mixture of Monosemantic Experts for Transformers
☆73Updated 4 months ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆78Updated 3 months ago
technion-cs-nlp / LLMsKnow
☆78Updated 9 months ago
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆114Updated last month
facebookresearch / iGSM
The code for creating the iGSM datasets in papers "Physics of Language Models Part 2.1, Grade-School Math and the Hidden Reasoning Proces…
☆79Updated 9 months ago
montemac / activation_additions
Algebraic value editing in pretrained language models
☆66Updated last year
shuyhere / Awesome-Sparse-Autoencoder
Collection of Reverse Engineering in Large Model
☆34Updated 9 months ago
yihedeng9 / rlhf-summary-notes
A brief and partial summary of RLHF algorithms.
☆133Updated 7 months ago
Dakingrai / awesome-mechanistic-interpretability-lm-papers
☆206Updated 11 months ago
bethgelab / sober-reasoning
A Sober Look at Language Model Reasoning
☆86Updated 3 weeks ago
saprmarks / geometry-of-truth
☆92Updated last year
MingLiiii / Layer_Gradient
[ACL'25 Oral] What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective
☆75Updated 4 months ago