vinid / safety-tuned-llamasLinks

ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.

☆87

Alternatives and similar repositories for safety-tuned-llamas

Users that are interested in safety-tuned-llamas are comparing it to the libraries listed below

Sorting:

paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 8 months ago
chujiezheng / LLM-Safeguard
Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"
☆97Updated 5 months ago
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆99Updated last year
boyiwei / alignment-attribution-code
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆85Updated 6 months ago
kevinyaobytedance / llm_unlearn
LLM Unlearning
☆176Updated 2 years ago
princeton-nlp / benign-data-breaks-safety
☆41Updated last year
declare-lab / red-instruct
Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
☆105Updated last year
DAMO-NLP-SG / multilingual-safety-for-LLMs
[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"
☆89Updated last year
SORRY-Bench / sorry-bench
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆62Updated 7 months ago
licong-lin / negative-preference-optimization
☆66Updated last year
Unispac / shallow-vs-deep-alignment
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆157Updated 6 months ago
uw-nsl / SafeDecoding
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆146Updated last year
git-disl / Vaccine
This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)
☆47Updated 11 months ago
Improbable-AI / curiosity_redteam
Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…
☆83Updated last year
Princeton-SysML / Jailbreak_LLM
☆185Updated last year
facebookresearch / advprompter
Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873
☆168Updated last year
sail-sg / I-FSJ
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆65Updated 9 months ago
swj0419 / muse_bench
☆28Updated 7 months ago
jinzhuoran / RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
☆83Updated last year
zjysteven / mink-plus-plus
[ICLR'25 Spotlight] Min-K%++: Improved baseline for detecting pre-training data of LLMs
☆45Updated 4 months ago
centerforaisafety / tdc2023-starter-kit
This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.
☆89Updated last year
zhaoyiran924 / Probe-Sampling
[NeurIPS 2024] Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling
☆30Updated 11 months ago
sail-sg / closer-look-LLM-unlearning
[ICLR 2025] A Closer Look at Machine Unlearning for Large Language Models
☆38Updated 10 months ago
rishub-tamirisa / tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆62Updated 4 months ago
YihanWang617 / llm-jailbreaking-defense
A lightweight library for large laguage model (LLM) jailbreaking defense.
☆57Updated last month
OpenSafetyLab / SALAD-BENCH
【ACL 2024】 SALAD benchmark & MD-Judge
☆163Updated 7 months ago
git-disl / Safety-Tax
This is the official code for the paper "Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable".
☆26Updated 7 months ago
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆60Updated last year
XuandongZhao / weak-to-strong
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆86Updated 5 months ago
OPTML-Group / SOUL
Official repo for EMNLP'24 paper "SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning"
☆27Updated last year