XuandongZhao / weak-to-strongLinks
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆86Updated 5 months ago
Alternatives and similar repositories for weak-to-strong
Users that are interested in weak-to-strong are comparing it to the libraries listed below
Sorting:
- ☆184Updated last year
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆98Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆116Updated 7 months ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆87Updated last year
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆65Updated 8 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆144Updated last year
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆97Updated 4 months ago