princeton-nlp / benign-data-breaks-safetyLinks

☆43

Alternatives and similar repositories for benign-data-breaks-safety

Users that are interested in benign-data-breaks-safety are comparing it to the libraries listed below

Sorting:

git-disl / Vaccine
This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)
☆48Updated last year
jinzhuoran / RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
☆86Updated last year
chujiezheng / LLM-Safeguard
Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"
☆101Updated 7 months ago
boyiwei / CoTaEval
[NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models
☆17Updated last year
boyiwei / alignment-attribution-code
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆88Updated 8 months ago
OPTML-Group / SOUL
Official repo for EMNLP'24 paper "SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning"
☆28Updated last year
jaechan-repo / muse_bench
☆30Updated last year
licong-lin / negative-preference-optimization
☆69Updated last year
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆98Updated last year
vinid / safety-tuned-llamas
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆90Updated last year
swj0419 / muse_bench
☆31Updated 9 months ago
sail-sg / closer-look-LLM-unlearning
[ICLR 2025] A Closer Look at Machine Unlearning for Large Language Models
☆41Updated last year
ybwang119 / Awesome-reasoning-safety
This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RL
☆55Updated 3 months ago
OPTML-Group / Unlearn-Simple
[NeurIPS25] Official repo for "Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning"
☆37Updated 2 months ago
dannyallover / overthinking_the_truth
☆29Updated last year
deeplearning-wisc / picle
Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)
☆26Updated last year
David-Li0406 / AI-Supervision-Risk
☆21Updated 9 months ago
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆65Updated last year
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆117Updated 10 months ago
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆85Updated 9 months ago
ethz-spylab / unlearning-vs-safety
☆27Updated last year
uw-nsl / safechain
[ACL 25] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
☆27Updated 8 months ago
yihuaihong / ConceptVectors
[EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆39Updated 4 months ago
SALT-NLP / Efficient_Unlearning
☆38Updated 2 years ago
Unispac / shallow-vs-deep-alignment
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆166Updated 8 months ago
VITA-Group / SEAL
[COLM 2025] SEAL: Steerable Reasoning Calibration of Large Language Models for Free
☆47Updated 8 months ago
git-disl / Safety-Tax
This is the official code for the paper "Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable".
☆26Updated 9 months ago
milesaturpin / cot-unfaithfulness
☆51Updated 2 years ago
Vaidehi99 / InfoDeletionAttacks
☆48Updated 10 months ago
Improbable-AI / curiosity_redteam
Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…
☆84Updated last year