allenai / safety-eval
A simple evaluation of generative language models and safety classifiers.
☆36Updated 5 months ago
Alternatives and similar repositories for safety-eval:
Users that are interested in safety-eval are comparing it to the libraries listed below
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆56Updated last month
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆66Updated 10 months ago
- Repo for the research paper "Aligning LLMs to Be Robust Against Prompt Injection"☆32Updated last month
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆77Updated last year
- ☆34Updated last year
- ☆85Updated last year
- Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"☆64Updated 7 months ago
- Align your LM to express calibrated verbal statements of confidence in its long-form generations.☆20Updated 7 months ago
- ☆51Updated last year
- ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"☆32Updated 3 months ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆92Updated 8 months ago
- Function Vectors in Large Language Models (ICLR 2024)☆131Updated 3 months ago
- [NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs☆81Updated 2 months ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆26Updated 7 months ago
- Implementation of PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆31Updated 2 months ago
- Weak-to-Strong Jailbreaking on Large Language Models☆73Updated 10 months ago
- Monet: Mixture of Monosemantic Experts for Transformers☆43Updated this week
- Improving Alignment and Robustness with Circuit Breakers☆174Updated 3 months ago
- This repository includes code for the paper "Does Localization Inform Editing? Surprising Differences in Where Knowledge Is Stored vs. Ca…☆58Updated last year
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated 11 months ago
- [NeurIPS 2023 D&B Track] Code and data for paper "Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evalua…☆31Updated last year
- ☆50Updated 2 months ago
- The official repository of the paper "On the Exploitability of Instruction Tuning".☆58Updated 11 months ago
- SILO Language Models code repository☆81Updated 10 months ago
- ☆26Updated 6 months ago
- Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".☆65Updated 10 months ago
- [arXiv preprint] Official Repository for "Evaluating Language Models as Synthetic Data Generators"☆30Updated last month
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated 8 months ago
- ☆41Updated this week
- The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"☆108Updated last year