MurrayTom / SG-BenchLinks
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
☆19Updated 7 months ago
Alternatives and similar repositories for SG-Bench
Users that are interested in SG-Bench are comparing it to the libraries listed below
Sorting:
- 【ACL 2024】 SALAD benchmark & MD-Judge☆154Updated 4 months ago
- ☆93Updated 5 months ago
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆92Updated last month
- [ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization☆26Updated last year
- ☆29Updated last month
- Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks☆29Updated last year
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆56Updated 4 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆137Updated 11 months ago
- ☆50Updated last year
- ☆52Updated 3 months ago
- Accepted by ECCV 2024☆142Updated 9 months ago
- ☆33Updated 9 months ago
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆65Updated this week
- Official codebase for "STAIR: Improving Safety Alignment with Introspective Reasoning"☆57Updated 4 months ago
- [ICLR 2025] Dissecting adversarial robustness of multimodal language model agents☆97Updated 4 months ago
- To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models☆31Updated last month
- ☆14Updated last month
- The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Lang…☆121Updated 5 months ago
- ☆19Updated 4 months ago
- [NAACL2024] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey☆104Updated 11 months ago
- [COLM 2024] JailBreakV-28K: A comprehensive benchmark designed to evaluate the transferability of LLM jailbreak attacks to MLLMs, and fur…☆68Updated 2 months ago
- Code and data for paper "Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?". (ACL 2025 Main)☆16Updated last month
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆80Updated 3 months ago
- ☆46Updated 5 months ago
- [AAAI'25 (Oral)] Jailbreaking Large Vision-language Models via Typographic Visual Prompts☆156Updated 3 weeks ago
- S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models☆73Updated 2 weeks ago
- RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024☆77Updated 9 months ago
- Code for Findings-EMNLP 2023 paper: Multi-step Jailbreaking Privacy Attacks on ChatGPT☆33Updated last year
- Code for paper "Defending aginast LLM Jailbreaking via Backtranslation"☆29Updated 11 months ago
- [ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"☆78Updated last year