jam3scampbell / llama-lyingLinks

Code for our paper "Localizing Lying in Llama"

☆12

Alternatives and similar repositories for llama-lying

Users that are interested in llama-lying are comparing it to the libraries listed below

Sorting:

rishub-tamirisa / tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆65Updated 5 months ago
ethz-spylab / rlhf_trojan_competition
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
☆115Updated last year
centerforaisafety / tdc2023-starter-kit
This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.
☆90Updated last year
longtermrisk / openweights
A python sdk for LLM finetuning and inference on runpod infrastructure
☆16Updated 3 weeks ago
max-andr / adversarial-random-search-gpt4
Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]
☆43Updated last year
ejones313 / auditing-llms
☆59Updated 2 years ago
JoshEngels / SAE-Dark-Matter
Code for our paper "Decomposing The Dark Matter of Sparse Autoencoders"
☆24Updated 9 months ago
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆245Updated last year
aengusl / latent-adversarial-training
☆46Updated last year
mmazeika / tdc-starter-kit
Starter kit and data loading code for the Trojan Detection Challenge NeurIPS 2022 competition
☆33Updated 2 years ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆100Updated 2 years ago
thestephencasper / latent_adversarial_training
☆23Updated last year
princeton-polaris-lab / Evaluating-Durable-Safeguards
[ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs
☆13Updated 5 months ago
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 9 months ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆130Updated 9 months ago
SchwinnL / circuit-breakers-eval
Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting
☆18Updated 7 months ago
centerforaisafety / Intro_to_ML_Safety
☆75Updated 2 years ago
locuslab / acr-memorization
☆37Updated 11 months ago
arobey1 / smooth-llm
☆114Updated 2 years ago
ethz-spylab / autoadvexbench
☆33Updated 6 months ago
Confirm-Solutions / flrt
Fluent student-teacher redteaming
☆23Updated last year
JonasGeiping / carving
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆70Updated last year
ethz-spylab / rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
☆62Updated last year
UKGovernmentBEIS / control-arena
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆132Updated this week
iamgroot42 / mimir
Python package for measuring memorization in LLMs.
☆175Updated 4 months ago
Breakend / SelfDestructingModels
☆12Updated 2 years ago
thestephencasper / explore_establish_exploit_llms
☆31Updated 2 years ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆195Updated last year
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆156Updated 6 months ago
milesaturpin / cot-unfaithfulness
☆51Updated 2 years ago