allenai / safety-eval
A simple evaluation of generative language models and safety classifiers.
☆41Updated 6 months ago
Alternatives and similar repositories for safety-eval:
Users that are interested in safety-eval are comparing it to the libraries listed below
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆61Updated 2 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆84Updated last week
- Implementation of PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆32Updated 3 months ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆66Updated last year
- ☆61Updated this week
- [arXiv preprint] Official Repository for "Evaluating Language Models as Synthetic Data Generators"☆34Updated 2 months ago
- ☆57Updated 3 months ago
- Code associated with Tuning Language Models by Proxy (Liu et al., 2024)☆104Updated 10 months ago
- The Paper List on Data Contamination for Large Language Models Evaluation.☆91Updated last month
- Scalable Meta-Evaluation of LLMs as Evaluators☆43Updated last year
- Run safety benchmarks against AI models and view detailed reports showing how well they performed.☆79Updated this week
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆101Updated 9 months ago
- This repository contains the official code for the paper: "Prompt Injection: Parameterization of Fixed Inputs"☆32Updated 5 months ago
- ☆31Updated last year
- [NeurIPS 2024] Train LLMs with diverse system messages reflecting individualized preferences to generalize to unseen system messages☆42Updated 2 months ago
- 🤫 Code and benchmark for our ICLR 2024 spotlight paper: "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Con…☆39Updated last year
- ☆36Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆185Updated 4 months ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆37Updated last month
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆90Updated 11 months ago
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]☆29Updated 3 weeks ago
- ☆27Updated 11 months ago
- Critique-out-Loud Reward Models☆52Updated 4 months ago
- FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions☆42Updated 7 months ago
- ☆44Updated 5 months ago
- [ICLR 2025] InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales☆72Updated 2 weeks ago
- Weak-to-Strong Jailbreaking on Large Language Models☆72Updated last year
- [ICLR 2025] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates (Oral)☆70Updated 3 months ago
- Restore safety in fine-tuned language models through task arithmetic☆27Updated 10 months ago