EleutherAI / elkLinks

Keeping language models honest by directly eliciting knowledge encoded in their activations.

☆207

Alternatives and similar repositories for elk

Users that are interested in elk are comparing it to the libraries listed below

Sorting:

EleutherAI / concept-erasure
Erasing concepts from neural representations with provable guarantees
☆231Updated 6 months ago
anthropics / evals
☆285Updated last year
collin-burns / discovering_latent_knowledge
☆274Updated last year
likenneth / othello_world
Emergent world representations: Exploring a sequence model trained on a synthetic task
☆184Updated 2 years ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆272Updated 7 months ago
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆512Updated last year
ArthurConmy / Automatic-Circuit-Discovery
☆233Updated 10 months ago
neelnanda-io / 1L-Sparse-Autoencoder
☆123Updated last year
nostalgebraist / transformer-utils
Utilities for the HuggingFace transformers library
☆70Updated 2 years ago
anthropics / toy-models-of-superposition
Notebooks accompanying Anthropic's "Toy Models of Superposition" paper
☆127Updated 2 years ago
callummcdougall / sae_visualizer
☆28Updated last year
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆207Updated 7 months ago
justinchiu / openlogprobs
Extract full next-token probabilities via language model APIs
☆247Updated last year
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆85Updated last year
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆200Updated this week
moirage / alignment-research-dataset
A dataset of alignment research and code to reproduce it
☆77Updated 2 years ago
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
annahdo / implementing_activation_steering
A collection of different ways to implement accessing and modifying internal model activations for LLMs
☆19Updated 9 months ago
TomFrederik / unseal
Mechanistic Interpretability for Transformer Models
☆51Updated 3 years ago
redwoodresearch / Easy-Transformer
☆121Updated 11 months ago
METR / task-standard
METR Task Standard
☆154Updated 5 months ago
AsaCooperStickland / situational-awareness-evals
Measuring the situational awareness of language models
☆37Updated last year
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆195Updated 8 months ago
aypan17 / machiavelli
☆137Updated last week
google-deepmind / mishax
☆134Updated 4 months ago
METR / RE-Bench
☆94Updated 3 months ago
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Updated last year
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆167Updated last year
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆119Updated 5 months ago
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆57Updated 9 months ago