XuandongZhao / weak-to-strongLinks
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆76Updated 2 months ago
Alternatives and similar repositories for weak-to-strong
Users that are interested in weak-to-strong are comparing it to the libraries listed below
Sorting:
- ☆175Updated last year
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆94Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"