Jayfeather1024/Backdoor-Enhanced-Alignment

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/Jayfeather1024/Backdoor-Enhanced-Alignment)

Jayfeather1024 / Backdoor-Enhanced-Alignment

☆24

Alternatives and similar repositories for Backdoor-Enhanced-Alignment

Users that are interested in Backdoor-Enhanced-Alignment are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

princeton-polaris-lab / Evaluating-Durable-Safeguards
View on GitHub
[ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs
☆13Jun 20, 2025Updated last year
Lslland / T-Vaccine
View on GitHub
☆19Jun 21, 2025Updated last year
Jayfeather1024 / DensePure
View on GitHub
☆21Oct 5, 2023Updated 2 years ago
IBM / SafeLoRA
View on GitHub
Github repo for NeurIPS 2024 paper "Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models"
☆29Dec 21, 2025Updated 7 months ago
LLLeoLi / LARF
View on GitHub
[EMNLP 2025] Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
☆15Jul 22, 2025Updated 11 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
aengusl / latent-adversarial-training
View on GitHub
☆48Sep 29, 2024Updated last year
LLM-Tuning-Safety / LLMs-Finetuning-Safety
View on GitHub
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆358Feb 23, 2024Updated 2 years ago
git-disl / Booster
View on GitHub
This is the official code for the paper "Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturba…
☆40Mar 22, 2025Updated last year
amazon-science / SAGE
View on GitHub
☆22Jun 5, 2026Updated last month
baixianghuang / editing-attack
View on GitHub
Code and dataset for the paper: "Can Editing LLMs Inject Harm?" [AAAI'26]
☆21Dec 26, 2025Updated 6 months ago
CryptoAILab / misalignment
View on GitHub
[NDSS'25] The official implementation of safety misalignment.
☆19Jan 8, 2025Updated last year
git-disl / Vaccine
View on GitHub
This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)
☆51Jan 15, 2026Updated 6 months ago
w-yibo / R1-Compress
View on GitHub
[NeurIPS 2025@FoRLM] R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search
☆17Jan 24, 2026Updated 5 months ago
git-disl / awesome_LLM-harmful-fine-tuning-papers
View on GitHub
A survey on harmful fine-tuning attack for large language model (ACM CSUR)
☆247Jun 22, 2026Updated 3 weeks ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
homles11 / SaLoRA
View on GitHub
Code for “SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation(ICLR 2025)”
☆29Oct 23, 2025Updated 8 months ago
avalonstrel / Mitigating-the-Alignment-Tax-of-RLHF
View on GitHub
☆16Feb 8, 2024Updated 2 years ago
vinid / safety-tuned-llamas
View on GitHub
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆95May 9, 2024Updated 2 years ago
boyiwei / alignment-attribution-code
View on GitHub
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆91Mar 30, 2025Updated last year
Unispac / shallow-vs-deep-alignment
View on GitHub
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆186Apr 23, 2025Updated last year
felixbinder / introspection_self_prediction
View on GitHub
Code for experiments on self-prediction as a way to measure introspection in LLMs
☆16Dec 10, 2024Updated last year
SORRY-Bench / sorry-bench
View on GitHub
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆83Mar 1, 2025Updated last year
Zhaoxian-Wu / IOS
View on GitHub
Code for paper "Byzantine-Resilient Decentralized Stochastic Optimization with Robust Aggregation Rules"
☆20Apr 19, 2024Updated 2 years ago
uw-nsl / CleanGen
View on GitHub
[EMNLP 24] Official Implementation of CLEANGEN: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
☆19Mar 9, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
SophieZheng998 / ALI-Agent
View on GitHub
Official implementation for "ALI-Agent: Assessing LLMs'Alignment with Human Values via Agent-based Evaluation"
☆21Jan 31, 2026Updated 5 months ago
YiZeng623 / frequency-backdoor
View on GitHub
ICCV 2021, We find most existing triggers of backdoor attacks in deep learning contain severe artifacts in the frequency domain. This Rep…
☆48Apr 27, 2022Updated 4 years ago
PKU-YuanGroup / AsFT
View on GitHub
Code for the paper "AsFT: Anchoring Safety During LLM Fune-Tuning Within Narrow Safety Basin".
☆37Jul 10, 2025Updated last year
Alan-Qin / Transfer_attack_RAP
View on GitHub
Boosting the Transferability of Adversarial Attacks with Reverse Adversarial Perturbation (NeurIPS 2022)
☆33Dec 16, 2022Updated 3 years ago
wonderNefelibata / Awesome-LRM-Safety
View on GitHub
Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …
☆84Updated this week
hanruiqian / Awesome-Federated-LLM-Related-Works
View on GitHub
☆17Oct 12, 2023Updated 2 years ago
zjunlp / LookAheadTuning
View on GitHub
[WSDM 2026] LookAhead Tuning: Safer Language Models via Partial Answer Previews
☆17Dec 14, 2025Updated 7 months ago
poloclub / robust-principles
View on GitHub
Robust Principles: Architectural Design Principles for Adversarially Robust CNNs
☆24Jan 13, 2024Updated 2 years ago
renjie3 / MemAttn
View on GitHub
☆16Feb 23, 2025Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
HydroXai / Enhancing-Safety-in-Large-Language-Models
View on GitHub
Precision Knowledge Editing (PKE): A novel method to reduce toxicity in LLMs while preserving performance, with robust evaluations and ha…
☆11Nov 26, 2024Updated last year
nikolakon / Robust-Learning-from-Untrusted-Sources
View on GitHub
Code for reproducing the experiments of the ICML 2019 paper "Robust Learning from Untrusted Sources"
☆18Jul 5, 2019Updated 7 years ago
ethz-spylab / jailbreak-tax
View on GitHub
☆24Feb 17, 2026Updated 5 months ago
cnut1648 / Model-Fingerprint
View on GitHub
Fingerprint large language models
☆52Jul 11, 2024Updated 2 years ago
clearloveclearlove / BEAT
View on GitHub
☆15Feb 26, 2025Updated last year
adnansirajrakin / TBT-CVPR2020
View on GitHub
In the repository we provide a sample code to implement the Targeted Bit Trojan attack.
☆20Nov 7, 2020Updated 5 years ago
Suyi32 / Learning-to-Detect-Malicious-Clients-for-Robust-FL
View on GitHub
☆19Dec 7, 2020Updated 5 years ago