listen0425/Safety-Layers

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/listen0425/Safety-Layers)

listen0425 / Safety-Layers

code space of paper "Safety Layers in Aligned Large Language Models: The Key to LLM Security" (ICLR 2025)

☆25

Alternatives and similar repositories for Safety-Layers

Users that are interested in Safety-Layers are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

git-disl / Lisa
View on GitHub
This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)
☆29Sep 10, 2024Updated last year
LLLeoLi / LARF
View on GitHub
[EMNLP 2025] Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
☆15Jul 22, 2025Updated last year
DSN-2024 / DSN
View on GitHub
DSN jailbreak Attack & Evaluation Ensemble
☆17Feb 7, 2026Updated 5 months ago
chiayi-hsu / Ring-A-Bell
View on GitHub
☆45Jan 15, 2025Updated last year
hanshen95 / SEAL
View on GitHub
An implementation of SEAL: Safety-Enhanced Aligned LLM fine-tuning via bilevel data selection.
☆24Feb 20, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
vfleaking / PTST
View on GitHub
Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"
☆22Sep 21, 2025Updated 10 months ago
chuhac / Reasoning-to-Defend
View on GitHub
[EMNLP 2025] Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
☆12Aug 22, 2025Updated 11 months ago
git-disl / Virus
View on GitHub
This is the official code for the paper "Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation"
☆56Feb 2, 2025Updated last year
DyMessi / VisCRA
View on GitHub
☆19Dec 23, 2025Updated 7 months ago
THU-KEG / SafetyNeuron
View on GitHub
Data and code for the paper: Finding Safety Neurons in Large Language Models
☆30Jan 29, 2026Updated 5 months ago
ZJU-LLM-Safety / MAJIC-AAAI2026
View on GitHub
[AAAI-2026]MAJIC: Markovian Adaptive Jailbreaking. An automated black-box attack framework against LLMs that iteratively selects and fuse…
☆16Updated this week
XuankunRong / SafeGRPO
View on GitHub
[CVPR'26] SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization
☆21Feb 19, 2026Updated 5 months ago
baixianghuang / editing-attack
View on GitHub
Code and dataset for the paper: "Can Editing LLMs Inject Harm?" [AAAI'26]
☆21Dec 26, 2025Updated 7 months ago
zjunlp / AutoSteer
View on GitHub
[EMNLP 2025] AutoSteer: Automating Steering for Safe Multimodal Large Language Models
☆15Aug 21, 2025Updated 11 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
quicksviewer / quicksviewer
View on GitHub
☆19Jun 29, 2025Updated last year
git-disl / awesome_LLM-harmful-fine-tuning-papers
View on GitHub
A survey on harmful fine-tuning attack for large language model (ACM CSUR)
☆247Jun 22, 2026Updated last month
S2yyyy / OpenClaw-Analysis
View on GitHub
☆31Mar 11, 2026Updated 4 months ago
wbopan / safety-residual-space
View on GitHub
Multi-dimensional analysis of orthogonal safety directions in LLM alignment
☆23Jun 12, 2026Updated last month
grasses / PromptCARE
View on GitHub
Code for paper: "PromptCARE: Prompt Copyright Protection by Watermark Injection and Verification", IEEE S&P 2024.
☆35Aug 10, 2024Updated last year
Jayfeather1024 / Backdoor-Enhanced-Alignment
View on GitHub
☆24Dec 8, 2024Updated last year
thu-ml / STAIR
View on GitHub
Official codebase for "STAIR: Improving Safety Alignment with Introspective Reasoning"
☆89Feb 26, 2025Updated last year
JJ-Vice / BAGM
View on GitHub
All code and data necessary to replicate experiments in the paper BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Model…
☆13Sep 16, 2024Updated last year
CHATS-lab / LLMs_Encode_Harmfulness_Refusal_Separately
View on GitHub
☆41Jul 3, 2026Updated 3 weeks ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
Jinxiaolong1129 / Foot-in-the-door-Jailbreak
View on GitHub
☆23May 14, 2025Updated last year
declare-lab / resta
View on GitHub
Restore safety in fine-tuned language models through task arithmetic
☆33Mar 28, 2024Updated 2 years ago
thu-coai / TransferAttack
View on GitHub
[ACL 2025] Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints
☆19May 23, 2025Updated last year
ning-wang1 / manda
View on GitHub
code for infocom 2021 paper MANDA
☆11May 30, 2023Updated 3 years ago
criticalml-uw / TamperBench
View on GitHub
Toolkit to benchmark the tamper-resistance of LLMs.
☆28May 15, 2026Updated 2 months ago
mohsenhariri / spectral-kv
View on GitHub
Quantize What Counts: More for Keys, Less for Values (ACL 2026)
☆22Nov 7, 2025Updated 8 months ago
uw-nsl / ArtPrompt
View on GitHub
[ACL24] Official Repo of Paper `ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs`
☆102Aug 15, 2025Updated 11 months ago
XuandongZhao / weak-to-strong
View on GitHub
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆90May 2, 2025Updated last year
homles11 / SaLoRA
View on GitHub
Code for “SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation(ICLR 2025)”
☆29Oct 23, 2025Updated 9 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
likenneth / persona_drift
View on GitHub
Measuring and Controlling Persona Drift in Language Model Dialogs
☆26Feb 26, 2024Updated 2 years ago
dt-3t / Transformer-en-to-cn
View on GitHub
使用Transformer进行中英翻译（demo）
☆17Aug 25, 2023Updated 2 years ago
cadentj / caft
View on GitHub
☆25Mar 30, 2026Updated 3 months ago
Dtc7w3PQ / Visco-Attack
View on GitHub
Official implementation of Visco-Attack (EMNLP 2025 Main). An open-source one-click reproduction script is also provided.
☆31Apr 11, 2026Updated 3 months ago
Jometeorie / probing_llama
View on GitHub
☆17Feb 26, 2024Updated 2 years ago
botextractai / ai-langchain-react-agent
View on GitHub
Create a LangChain ReAct agent with multiple tools (Python REPL and DuckDuckGo Search)
☆14Updated this week
princeton-nlp / benign-data-breaks-safety
View on GitHub
☆47Oct 1, 2024Updated last year