ThuCCSLab / misalignmentLinks
[NDSS'25] The official implementation of safety misalignment.
☆15Updated 6 months ago
Alternatives and similar repositories for misalignment
Users that are interested in misalignment are comparing it to the libraries listed below
Sorting:
- ☆9Updated 3 weeks ago
- ☆30Updated 9 months ago
- ☆15Updated 5 months ago
- To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models☆31Updated last month
- [ICLR24] Official Repo of BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models☆36Updated 11 months ago
- Comprehensive Assessment of Trustworthiness in Multimodal Foundation Models☆21Updated 4 months ago
- ☆30Updated 2 months ago
- ☆11Updated this week
- ☆13Updated last year
- [NDSS 2025] Official code for our paper "Explanation as a Watermark: Towards Harmless and Multi-bit Model Ownership Verification via Wate…☆39Updated 8 months ago
- ☆31Updated 3 months ago
- Official Code for ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users (NeurIPS 2024)☆16Updated 8 months ago
- ☆18Updated 10 months ago
- ☆20Updated last year
- ☆56Updated last month
- ☆34Updated 3 months ago
- This is the code repository of our submission: Understanding the Dark Side of LLMs’ Intrinsic Self-Correction.☆56Updated 7 months ago
- ☆47Updated last year
- A curated list of trustworthy Generative AI papers. Daily updating...☆73Updated 10 months ago
- [ICLR 2024] Towards Elminating Hard Label Constraints in Gradient Inverision Attacks☆13Updated last year
- [ECCV'24 Oral] The official GitHub page for ''Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking …☆25Updated 8 months ago
- [EMNLP 24] Official Implementation of CLEANGEN: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models☆16Updated 4 months ago
- ☆20Updated last year
- [NeurIPS 2024] "Membership Inference on Text-to-image Diffusion Models via Conditional Likelihood Discrepancy"☆11Updated 2 weeks ago
- ☆18Updated 2 years ago
- Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents" [NeurIPS 2024]☆84Updated 9 months ago
- BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models☆177Updated 3 weeks ago
- ☆30Updated last year
- [ICLR 2024] Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images☆36Updated last year
- Distribution Preserving Backdoor Attack in Self-supervised Learning☆16Updated last year