MurrayTom / SG-Bench
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
☆11Updated 4 months ago
Alternatives and similar repositories for SG-Bench:
Users that are interested in SG-Bench are comparing it to the libraries listed below
- S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models☆55Updated last month
- RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024☆69Updated 6 months ago
- [ICLR'25] DataGen: Unified Synthetic Dataset Generation via Large Language Models☆44Updated 3 weeks ago
- The reinforcement learning codes for dataset SPA-VL☆31Updated 9 months ago
- ☆79Updated last week
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)☆68Updated last month
- Official github repo for AutoDetect, an automated weakness detection framework for LLMs.☆42Updated 9 months ago
- 【ACL 2024】 SALAD benchmark & MD-Judge☆134Updated 3 weeks ago
- ☆32Updated 5 months ago
- [ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"☆69Updated last year
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆88Updated 6 months ago
- [EMNLP 2024] The official GitHub repo for the paper "Course-Correction: Safety Alignment Using Synthetic Preferences"☆19Updated 5 months ago
- Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"☆63Updated last year
- ☆49Updated last month
- ☆43Updated 9 months ago
- [NAACL2024] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey☆91Updated 7 months ago
- [ICLR 2025] This is the code repo for our ICLR’25 paper "RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rew…☆31Updated last month
- ☆19Updated 5 months ago
- JAILJUDGE: A comprehensive evaluation benchmark which includes a wide range of risk scenarios with complex malicious prompts (e.g., synth…☆41Updated 3 months ago
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆73Updated last month
- [ICLR 25 Oral] RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style☆28Updated this week
- [ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"☆56Updated 6 months ago
- Code for Findings-EMNLP 2023 paper: Multi-step Jailbreaking Privacy Attacks on ChatGPT☆33Updated last year
- ☆80Updated last month
- ☆32Updated 3 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆124Updated 8 months ago
- LLM evaluation.☆14Updated last year
- ☆43Updated last month
- [arxiv:2412.04905] DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling☆13Updated 3 months ago
- [NeurIPS 2024] How do Large Language Models Handle Multilingualism?☆29Updated 4 months ago