paul-rottger / exaggerated-safety

Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"

☆77

Alternatives and similar repositories for exaggerated-safety:

Users that are interested in exaggerated-safety are comparing it to the libraries listed below

ejones313 / auditing-llms
☆51Updated last year
vinid / safety-tuned-llamas
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆75Updated 8 months ago
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆61Updated 2 months ago
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆88Updated 7 months ago
logix-project / logix
AI Logging for Interpretability and Explainability🔬
☆97Updated 7 months ago
declare-lab / red-instruct
Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
☆88Updated 10 months ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆85Updated last year
chujiezheng / LLM-Safeguard
Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"
☆83Updated 4 months ago
Vaidehi99 / InfoDeletionAttacks
☆39Updated last year
centerforaisafety / tdc2023-starter-kit
This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.
☆82Updated 7 months ago
boyiwei / alignment-attribution-code
Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆65Updated 3 months ago
OSU-NLP-Group / AmpleGCG
AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM
☆51Updated 2 months ago
boyiwei / CoTaEval
Official code for the paper: Evaluating Copyright Takedown Methods for Language Models
☆16Updated 6 months ago
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆174Updated 3 months ago
locuslab / tofu
Landing Page for TOFU
☆107Updated 3 weeks ago
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆92Updated 8 months ago
rishub-tamirisa / tamper-resistance
Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆42Updated 3 months ago
XuandongZhao / weak-to-strong
Weak-to-Strong Jailbreaking on Large Language Models
☆73Updated 10 months ago
tatsu-lab / test_set_contamination
☆36Updated last year
epfl-dlab / llm-latent-language
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
☆64Updated 10 months ago
ykwon0407 / DataInf
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)
☆57Updated 3 months ago
Unispac / shallow-vs-deep-alignment
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆63Updated 6 months ago
YihanWang617 / llm-jailbreaking-defense
A lightweight library for large laguage model (LLM) jailbreaking defense.
☆44Updated 3 months ago
Improbable-AI / curiosity_redteam
Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…
☆67Updated 10 months ago
licong-lin / negative-preference-optimization
☆44Updated 6 months ago
eric-mitchell / serac
Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model
☆66Updated 2 years ago
lapisrocks / rpo
Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"
☆48Updated 5 months ago
balevinstein / Probes
☆44Updated last year
kevinyaobytedance / llm_unlearn
LLM Unlearning
☆141Updated last year
uw-nsl / SafeDecoding
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆112Updated 5 months ago