redwoodresearch / Text-Steganography-BenchmarkLinks

Code for Preventing Language Models From Hiding Their Reasoning, which evaluates defenses against LLM steganography.

☆24

Alternatives and similar repositories for Text-Steganography-Benchmark

Users that are interested in Text-Steganography-Benchmark are comparing it to the libraries listed below

Sorting:

GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆235Updated last year
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆117Updated last year
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆97Updated 2 years ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆125Updated 7 months ago
anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆118Updated last year
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 7 months ago
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆72Updated last year
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆143Updated 4 months ago
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆136Updated 3 months ago
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆93Updated last year
IBM / sae-steering
Code to enable layer-level steering in LLMs using sparse auto encoders
☆26Updated 2 weeks ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆187Updated last year
ejones313 / auditing-llms
☆58Updated 2 years ago
milesaturpin / cot-unfaithfulness
☆48Updated last year
thestephencasper / explore_establish_exploit_llms
☆31Updated 2 years ago
ethz-spylab / rlhf_trojan_competition
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
☆114Updated last year
JonasGeiping / carving
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆70Updated last year
rishub-tamirisa / tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆61Updated 3 months ago
XuandongZhao / weak-to-strong
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆86Updated 5 months ago
GXimingLu / IPA
Codebase for Inference-Time Policy Adapters
☆24Updated last year
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆108Updated 2 weeks ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆77Updated 2 months ago
tonychenxyz / selfie
This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…
☆51Updated 9 months ago
redwoodresearch / Easy-Transformer
☆124Updated last year
ryoungj / ToolEmu
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
☆165Updated last year
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆80Updated 7 months ago
LRudL / sad
Situational Awareness Dataset
☆38Updated 9 months ago
max-andr / adversarial-random-search-gpt4
Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]
☆43Updated last year
AsaCooperStickland / situational-awareness-evals
Measuring the situational awareness of language models
☆38Updated last year