nrimsky / CAALinks

Steering Llama 2 with Contrastive Activation Addition

☆195

Alternatives and similar repositories for CAA

Users that are interested in CAA are comparing it to the libraries listed below

Sorting:

nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆99Updated 2 years ago
jacobdunefsky / transcoder_circuits
☆189Updated last year
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆123Updated 2 months ago
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆286Updated 2 years ago
saprmarks / feature-circuits
☆196Updated last month
OpenMOSS / Language-Model-SAEs
Performant framework for training, analyzing and visualizing Sparse Autoencoders (SAEs) and their frontier variants.
☆164Updated last week
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆228Updated this week
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆130Updated 9 months ago
montemac / activation_additions
Algebraic value editing in pretrained language models
☆66Updated 2 years ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆80Updated 4 months ago
ericwtodd / function_vectors
Function Vectors in Large Language Models (ICLR 2024)
☆186Updated 7 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆150Updated 5 months ago
adamkarvonen / SAEBench
☆136Updated 2 weeks ago
saprmarks / geometry-of-truth
☆95Updated last year
davidbau / baukit
☆238Updated last year
ArthurConmy / Automatic-Circuit-Discovery
☆255Updated last year
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆306Updated 5 months ago
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆84Updated 8 months ago
chrisliu298 / awesome-representation-engineering
A resource repository for representation engineering in large language models
☆142Updated last year
hannamw / EAP-IG
☆61Updated 4 months ago
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆62Updated last year
wesg52 / sparse-probing-paper
Sparse probing paper full code.
☆65Updated last year
yuzhaouoe / SAE-based-representation-engineering
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆67Updated last year
explanare / ravel
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆57Updated last month
roeehendel / icl_task_vectors
☆101Updated 2 years ago
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆231Updated 11 months ago
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆245Updated last year
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 9 months ago
logix-project / logix
AI Logging for Interpretability and Explainability🔬
☆133Updated last year
epfl-dlab / llm-latent-language
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
☆80Updated last year