Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
☆35Oct 19, 2023Updated 2 years ago
Alternatives and similar repositories for ShadowAlignment
Users that are interested in ShadowAlignment are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models☆90May 2, 2025Updated last year
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆351Feb 23, 2024Updated 2 years ago
- ☆15Feb 26, 2025Updated last year
- Defending Against Backdoor Attacks Using Robust Covariance Estimation☆22Jul 12, 2021Updated 4 years ago
- [ICLR 2022] Boosting Randomized Smoothing with Variance Reduced Classifiers☆11Mar 29, 2022Updated 4 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Code for paper: PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models, IEEE ICASSP 2024. Demo//124.220.228.133:11107☆21Aug 10, 2024Updated last year
- ☆12Dec 9, 2020Updated 5 years ago
- Implementation of the paper "Exploring the Universal Vulnerability of Prompt-based Learning Paradigm" on Findings of NAACL 2022☆32Jul 11, 2022Updated 3 years ago
- This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"☆17Feb 22, 2024Updated 2 years ago
- Certified Object Detection with Randomized Median Smoothing☆12Oct 21, 2020Updated 5 years ago
- Code for the paper "Deep Partition Aggregation: Provable Defenses against General Poisoning Attacks"☆14Aug 22, 2022Updated 3 years ago
- Use the tokenizer in parallel to achieve superior acceleration☆20Mar 21, 2024Updated 2 years ago
- RAB: Provable Robustness Against Backdoor Attacks☆39Oct 3, 2023Updated 2 years ago
- The code of paper "Toward Optimal LLM Alignments Using Two-Player Games".☆17Jun 20, 2024Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- 毕设,分层自嵌入数字水印☆12Sep 2, 2019Updated 6 years ago
- A graph-based deep learning tool that can recognizes the kernel objects from raw memory dumps.☆14Jul 6, 2019Updated 6 years ago
- Parallel Bread first Search on Hadoop☆17May 20, 2022Updated 4 years ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆152Jul 19, 2024Updated last year
- AutoLR: Layer-wise Pruning and Auto-tuning of Learning Rates in Fine-tuning of Deep Networks☆17Jan 27, 2021Updated 5 years ago
- [EMNLP 2024 Findings] Unlocking Continual Learning Abilities in Language Models☆26Oct 8, 2024Updated last year
- This is the code repo for the paper <UTC-IE: A Unified Token-pair Classification Architecture for Information Extraction>☆15Aug 10, 2023Updated 2 years ago
- Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination.☆22Jul 18, 2025Updated 10 months ago
- Code of Robust Lottery Tickets for Pre-trained Language Models (ACL2022)☆20Jul 18, 2022Updated 3 years ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- ☆45Jun 25, 2025Updated 11 months ago
- [ACL 2024] ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models☆27Jan 11, 2025Updated last year
- ☆23Oct 14, 2024Updated last year
- An easy-to-use Python framework to generate adversarial jailbreak prompts.☆864Mar 30, 2026Updated 2 months ago
- Program Translator AI built on Pytorch☆15Dec 19, 2019Updated 6 years ago
- This is a repository for "PMET: Precise Model Editing in a Transformer"☆57Sep 28, 2023Updated 2 years ago
- LongMIT: Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets☆43Sep 30, 2024Updated last year
- MINER: Mutual Information based Named Entity Recognition☆37May 24, 2022Updated 4 years ago
- [ACL 25] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities☆30Apr 2, 2025Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- ☆29Mar 20, 2024Updated 2 years ago
- This is the official implementation of TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data☆13Jul 21, 2024Updated last year
- [NDSS 2025] Official code for our paper "Explanation as a Watermark: Towards Harmless and Multi-bit Model Ownership Verification via Wate…☆46Nov 5, 2024Updated last year
- Dynamic, high-resolution poverty measurement in data-scarce environments☆11Dec 8, 2024Updated last year
- ☆31Feb 27, 2025Updated last year
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆108May 20, 2025Updated last year
- Master's project☆19Sep 11, 2019Updated 6 years ago