LRudL / sadLinks

Situational Awareness Dataset

☆40

Alternatives and similar repositories for sad

Users that are interested in sad are comparing it to the libraries listed below

Sorting:

LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Updated last year
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆80Updated 4 months ago
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆55Updated 6 months ago
JoshEngels / MultiDimensionalFeatures
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆84Updated last year
METR / RE-Bench
☆119Updated last month
jiahai-feng / binding-iclr
☆16Updated last year
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆97Updated 2 years ago
safety-research / SHADE-Arena
☆20Updated 5 months ago
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆122Updated last year
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆145Updated 5 months ago
JacobPfau / fillerTokens
☆75Updated last year
clarifying-EM / model-organisms-for-EM
Code repo for the model organisms and convergent directions of EM papers.
☆40Updated 2 months ago
AsaCooperStickland / situational-awareness-evals
Measuring the situational awareness of language models
☆39Updated last year
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆130Updated 9 months ago
safety-research / open-source-alignment-faking
Open Source Replication of Anthropic's Alignment Faking Paper
☆51Updated 7 months ago
ContextualAI / CLAIR_and_APO
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
☆60Updated last year
XuchanBao / behavioral-self-awareness
☆32Updated 9 months ago
google-deepmind / mishax
☆143Updated 2 months ago
GXimingLu / IPA
Codebase for Inference-Time Policy Adapters
☆24Updated 2 years ago
samuelarnesen / nyu-debate-modeling
☆23Updated last year
ckkissane / sae-transfer
Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"
☆13Updated last year
Zhiyuan-Zeng / EvalTree
[COLM 2025] EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees
☆28Updated 4 months ago
casmlab / NPHardEval
Repository for NPHardEval, a quantified-dynamic benchmark of LLMs
☆61Updated last year
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆62Updated last year
anthropics / toy-models-of-superposition
Notebooks accompanying Anthropic's "Toy Models of Superposition" paper
☆130Updated 3 years ago
lingo-mit / lm-truthfulness
☆17Updated last year
noanabeshima / tinymodel
A TinyStories LM with SAEs and transcoders
☆13Updated 7 months ago
KihoPark / LLM_Categorical_Hierarchical_Representations
☆111Updated 9 months ago
ConsequentAI / fneval
Functional Benchmarks and the Reasoning Gap
☆90Updated last year