annahdo / implementing_activation_steeringLinks

A collection of different ways to implement accessing and modifying internal model activations for LLMs

☆19

Alternatives and similar repositories for implementing_activation_steering

Users that are interested in implementing_activation_steering are comparing it to the libraries listed below

Sorting:

ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆55Updated 5 months ago
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆59Updated last year
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆126Updated 8 months ago
mishajw / repeng
Experiments with representation engineering
☆13Updated last year
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆98Updated 2 years ago
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Updated last year
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆222Updated 10 months ago
amack315 / unsupervised-steering-vectors
☆36Updated last year
EleutherAI / concept-erasure
Erasing concepts from neural representations with provable guarantees
☆238Updated 9 months ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆78Updated 3 months ago
montemac / activation_additions
Algebraic value editing in pretrained language models
☆66Updated last year
KoyenaPal / future-lens
Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
☆19Updated this week
EleutherAI / elk
Keeping language models honest by directly eliciting knowledge encoded in their activations.
☆209Updated 2 weeks ago
jbloomAus / SAEDashboard
☆74Updated 2 weeks ago
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆94Updated 2 years ago
neelnanda-io / 1L-Sparse-Autoencoder
☆128Updated last year
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆219Updated last week
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆188Updated last year
EleutherAI / steering-llama3
☆30Updated last year
koayon / atp_star
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
☆20Updated 9 months ago
slavachalnev / SAE-TS
Improving Steering Vectors by Targeting Sparse Autoencoder Features
☆25Updated 11 months ago
ApolloResearch / apd
Attribution-based Parameter Decomposition
☆31Updated 4 months ago
AsaCooperStickland / situational-awareness-evals
Measuring the situational awareness of language models
☆38Updated last year
Butanium / tiny-activation-dashboard
A tiny easily hackable implementation of a feature dashboard.
☆15Updated last month
redwoodresearch / Easy-Transformer
☆126Updated last year
callummcdougall / sae_visualizer
☆29Updated last year
nostalgebraist / transformer-utils
Utilities for the HuggingFace transformers library
☆72Updated 2 years ago
ArthurConmy / Automatic-Circuit-Discovery
☆247Updated last year
saprmarks / feature-circuits
☆191Updated last week