BeyonderXX/ShadowAlignment

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/BeyonderXX/ShadowAlignment)

BeyonderXX / ShadowAlignment

Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

☆35

Alternatives and similar repositories for ShadowAlignment

Users that are interested in ShadowAlignment are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

XuandongZhao / weak-to-strong
View on GitHub
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆90May 2, 2025Updated last year
WalterSumbon / better-claw
View on GitHub
Turn your claude code into clawdbot
☆22Mar 28, 2026Updated 3 months ago
arpitbansal297 / Certified_Watermarks
View on GitHub
☆16Jul 17, 2022Updated 4 years ago
LLM-Tuning-Safety / LLMs-Finetuning-Safety
View on GitHub
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆358Feb 23, 2024Updated 2 years ago
SewoongLab / spectre-defense
View on GitHub
Defending Against Backdoor Attacks Using Robust Covariance Estimation
☆22Jul 12, 2021Updated 5 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
Wizardcoast / Linear_Alignment
View on GitHub
This repo is reproduction resources for linear alignment paper, still working
☆17May 19, 2024Updated 2 years ago
eth-sri / smoothing-ensembles
View on GitHub
[ICLR 2022] Boosting Randomized Smoothing with Variance Reduced Classifiers
☆11Mar 29, 2022Updated 4 years ago
Dicer-Zz / EPI
View on GitHub
Code for the paper: Rehearsal-free Continual Language Learning via Efficient Parameter Isolation
☆13May 16, 2023Updated 3 years ago
grasses / PoisonPrompt
View on GitHub
Code for paper: PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models, IEEE ICASSP 2024. Demo//124.220.228.133:11107
☆21Aug 10, 2024Updated last year
KYLN24 / CritiQ
View on GitHub
Repository of the paper ''CritiQ: Mining Data Quality Criteria from Human Preferences". Code for CritiQ Flow & Training CritiQ Scorer.
☆22Dec 11, 2025Updated 7 months ago
alevine0 / DPA
View on GitHub
Code for the paper "Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks"
☆14Aug 22, 2022Updated 3 years ago
UmeanNever / NovelSum
View on GitHub
[ACL 2025 Main] Official Repo for Paper "Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric"
☆42Feb 10, 2026Updated 5 months ago
jinyuan-jia / BaggingCertifyDataPoisoning
View on GitHub
☆12Dec 9, 2020Updated 5 years ago
zirui-HIT / NLP_Model
View on GitHub
☆22May 26, 2021Updated 5 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
limenlp / safer-instruct
View on GitHub
This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"
☆17Feb 22, 2024Updated 2 years ago
hammlab / PoisoningCertifiedDefenses
View on GitHub
How Robust are Randomized Smoothing based Defenses to Data Poisoning? (CVPR 2021)
☆14Jul 16, 2021Updated 5 years ago
Ping-C / CertifiedObjectDetection
View on GitHub
Certified Object Detection with Randomized Median Smoothing
☆12Oct 21, 2020Updated 5 years ago
6183191 / layered-self-embedding-watermarking
View on GitHub
毕设，分层自嵌入数字水印
☆12Sep 2, 2019Updated 6 years ago
AI-secure / Robustness-Against-Backdoor-Attacks
View on GitHub
RAB: Provable Robustness Against Backdoor Attacks
☆40Oct 3, 2023Updated 2 years ago
clearloveclearlove / BEAT
View on GitHub
☆15Feb 26, 2025Updated last year
thu-coai / SafeUnlearning
View on GitHub
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
☆32Jul 9, 2024Updated 2 years ago
ruizheng20 / gpo
View on GitHub
The code of paper "Toward Optimal LLM Alignments Using Two-Player Games".
☆17Jun 20, 2024Updated 2 years ago
LUMIA-Group / ConceptLM
View on GitHub
Official Implementation of ConceptLM.
☆23Mar 18, 2026Updated 4 months ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
NayMyatMin / CROW
View on GitHub
Internal Consistency Regularization (CROW) for LLM Backdoor Elimination - Paper accepted to ICML 2025
☆16May 6, 2025Updated last year
wenyudu / MIGU
View on GitHub
[EMNLP 2024 Findings] Unlocking Continual Learning Abilities in Language Models
☆26Oct 8, 2024Updated last year
uw-nsl / SafeDecoding
View on GitHub
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆154Jul 19, 2024Updated 2 years ago
LUMIA-Group / HuRef
View on GitHub
Official implementation for "HuRef: HUman-REadable Fingerprint for Large Language Models" (NeurIPS2024)
☆16Jun 17, 2025Updated last year
Linzwcs / AutoMusicTheoryQA
View on GitHub
☆22Nov 21, 2025Updated 8 months ago
ruizheng20 / robust_ticket
View on GitHub
Code of Robust Lottery Tickets for Pre-trained Language Models (ACL2022)
☆20Jul 18, 2022Updated 4 years ago
yhcc / utcie
View on GitHub
This is the code repo for the paper <UTC-IE: A Unified Token-pair Classification Architecture for Information Extraction>
☆15Aug 10, 2023Updated 2 years ago
DPamK / BadAgent
View on GitHub
☆33Feb 27, 2025Updated last year
EasyJailbreak / EasyJailbreak
View on GitHub
An easy-to-use Python framework to generate adversarial jailbreak prompts.
☆872Mar 30, 2026Updated 3 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
WowCZ / LongMIT
View on GitHub
LongMIT: Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets
☆43Sep 30, 2024Updated last year
SheltonLiu-N / AutoDAN
View on GitHub
[ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…
☆453Jan 22, 2025Updated last year
nrimsky / LM-exp
View on GitHub
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆104Sep 21, 2023Updated 2 years ago
morning-hao / domain-self-instruct
View on GitHub
受到self-instruct启发,除了通用LLM还能做垂直领域的小LLM实现定制效果，通过GPT获得question和answer来作为训练数据
☆18May 12, 2023Updated 3 years ago
xpq-tech / PMET
View on GitHub
This is a repository for "PMET: Precise Model Editing in a Transformer"
☆58Sep 28, 2023Updated 2 years ago
shaoshuo-ss / EaaW
View on GitHub
[NDSS 2025] Official code for our paper "Explanation as a Watermark: Towards Harmless and Multi-bit Model Ownership Verification via Wate…
☆45Nov 5, 2024Updated last year
bot-ssttkkl / nonebot-plugin-mahjong-utils
View on GitHub
Nonebot麻将小工具插件，支持手牌分析、番符点数查询
☆11Apr 14, 2025Updated last year