aogara-ds / hoodwinked-websiteLinks

A text-based game where language models learn to lie and to detect lies.

☆12

Alternatives and similar repositories for hoodwinked-website

Users that are interested in hoodwinked-website are comparing it to the libraries listed below

Sorting:

UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆67Updated 2 months ago
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆27Updated last year
milesaturpin / cot-unfaithfulness
☆44Updated last year
Jiaxin-Wen / MisleadLM
Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""
☆14Updated 8 months ago
callummcdougall / sae_visualizer
☆28Updated last year
montemac / activation_additions
Algebraic value editing in pretrained language models
☆65Updated last year
google-deepmind / dangerous-capability-evaluations
☆55Updated 9 months ago
Aaquib111 / edge-attribution-patching
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"
☆35Updated last year
aypan17 / machiavelli
☆134Updated 7 months ago
tml-epfl / icl-alignment
Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]
☆30Updated 5 months ago
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆81Updated last year
saprmarks / geometry-of-truth
☆85Updated 10 months ago
redwoodresearch / Text-Steganography-Benchmark
Code for Preventing Language Models From Hiding Their Reasoning, which evaluates defenses against LLM steganography.
☆22Updated last year
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆70Updated last year
EleutherAI / w2s
☆23Updated 9 months ago
ejnnr / cupbearer
A library for mechanistic anomaly detection
☆22Updated 5 months ago
rishub-tamirisa / tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆59Updated 2 weeks ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆94Updated last year
AsaCooperStickland / situational-awareness-evals
Measuring the situational awareness of language models
☆35Updated last year
redwoodresearch / alignment_faking_public
☆66Updated last month
redwoodresearch / interp
Redwood Research's transformer interpretability tools
☆14Updated 3 years ago
IBM / activation-steering
General-purpose activation steering library
☆78Updated last month
koayon / atp_star
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
☆18Updated 5 months ago
lingo-mit / lm-truthfulness
☆17Updated last year
safety-research / open-source-alignment-faking
Open Source Replication of Anthropic's Alignment Faking Paper
☆13Updated 2 months ago
MaheepChaudhary / SAE-Ravel
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆11Updated 5 months ago
LRudL / evalugator
(Model-written) LLM evals library
☆18Updated 6 months ago
Cadenza-Labs / sleeper-agents
☆11Updated 11 months ago
thestephencasper / everything-you-need
we got you bro
☆35Updated 10 months ago
Nix07 / finetuning
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…
☆27Updated last year