usail-hkust / JailjudgeLinks
JAILJUDGE: A comprehensive evaluation benchmark which includes a wide range of risk scenarios with complex malicious prompts (e.g., synthetic, adversarial, in-the-wild, and multi-language scenarios, etc.) along with high-quality human- annotated test datasets.
☆48Updated 7 months ago
Alternatives and similar repositories for Jailjudge
Users that are interested in Jailjudge are comparing it to the libraries listed below
Sorting:
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)☆142Updated 7 months ago
- ☆93Updated 5 months ago
- 【ACL 2024】 SALAD benchmark & MD-Judge☆155Updated 4 months ago
- ☆92Updated 2 months ago
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)☆80Updated 2 months ago
- ☆33Updated 9 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆137Updated last year
- ☆41Updated 2 months ago
- [NAACL2024] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey☆105Updated 11 months ago
- S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models☆73Updated 2 weeks ago
- Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents" [NeurIPS 2024]☆81Updated 9 months ago
- Official codebase for "STAIR: Improving Safety Alignment with Introspective Reasoning"☆57Updated 4 months ago
- Official repository for "Safety in Large Reasoning Models: A Survey" - Exploring safety risks, attacks, and defenses for Large Reasoning …☆60Updated last month
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆65Updated this week
- LLM Unlearning☆171Updated last year
- ☆175Updated last year
- A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.☆63Updated last month
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)