Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆76Mar 1, 2025Updated last year
Alternatives and similar repositories for sorry-bench
Users that are interested in sorry-bench are comparing it to the libraries listed below
Sorting:
- [ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs☆13Jun 20, 2025Updated 8 months ago
- Code to break Llama Guard☆32Dec 7, 2023Updated 2 years ago
- The first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods!☆14Apr 8, 2025Updated 10 months ago
- Code repo of our paper Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis (https://arxiv.org/abs/2406.10794…☆23Jul 26, 2024Updated last year
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆89Mar 30, 2025Updated 11 months ago
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]☆535Apr 4, 2025Updated 10 months ago
- A fast + lightweight implementation of the GCG algorithm in PyTorch☆318May 13, 2025Updated 9 months ago
- [ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization☆29Jul 9, 2024Updated last year
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆341Feb 23, 2024Updated 2 years ago
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]☆21May 2, 2024Updated last year
- This is the official Gtihub repo for our paper: "BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Lang…☆21Jul 3, 2024Updated last year
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆66Jun 9, 2025Updated 8 months ago
- Code to replicate the Representation Noising paper and tools for evaluating defences against harmful fine-tuning☆23Dec 12, 2024Updated last year
- [NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models☆17Jul 17, 2024Updated last year
- Source code of "What can linearized neural networks actually say about generalization?☆20Oct 21, 2021Updated 4 years ago
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆174Apr 23, 2025Updated 10 months ago
- ☆23Jun 13, 2024Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆128Feb 24, 2025Updated last year
- ☆48Sep 29, 2024Updated last year
- ☆44Oct 1, 2024Updated last year
- "Tight Certificates of Adversarial Robustness for Randomly Smoothed Classifiers" (NeurIPS 2019, previously called "A Stratified Approach …☆17Nov 16, 2019Updated 6 years ago
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆864Aug 16, 2024Updated last year
- ☆24Dec 8, 2024Updated last year
- ☆26Mar 4, 2025Updated 11 months ago
- Official Code for ACL 2024 paper "GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis"☆65Oct 27, 2024Updated last year
- ☆25Sep 3, 2025Updated 6 months ago
- The code implementation of MuScleLoRA (Accepted in ACL 2024)☆10Dec 1, 2024Updated last year
- ACL24☆11Jun 7, 2024Updated last year
- ☆14Feb 26, 2025Updated last year
- ☆13Jun 25, 2025Updated 8 months ago
- Code for paper "Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion"☆14Mar 28, 2024Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- Repository for "StrongREJECT for Empty Jailbreaks" paper☆152Nov 3, 2024Updated last year
- Code repository for the paper "The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Le…☆13Jan 16, 2025Updated last year
- official implementation of [USENIX Sec'25] StruQ: Defending Against Prompt Injection with Structured Queries☆63Nov 10, 2025Updated 3 months ago
- Provable Worst Case Guarantees for the Detection of Out-of-Distribution Data☆13Sep 20, 2022Updated 3 years ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆151Jul 19, 2024Updated last year
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]☆32Jan 23, 2025Updated last year
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆108Mar 8, 2024Updated last year