allenai / safety-evalLinks
A simple evaluation of generative language models and safety classifiers.
☆54Updated 10 months ago
Alternatives and similar repositories for safety-eval
Users that are interested in safety-eval are comparing it to the libraries listed below
Sorting:
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆77Updated 6 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆97Updated 3 months ago
- ☆81Updated 6 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆89Updated last week
- ☆36Updated 2 years ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆84Updated last year
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆56Updated 3 months ago
- General-purpose activation steering library☆75Updated 3 weeks ago
- ☆44Updated last year
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated last year
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆76Updated last year
- ☆70Updated 4 months ago
- [NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models☆17Updated 10 months ago
- Data and code for the preprint "In-Context Learning with Long-Context Models: An In-Depth Exploration"☆35Updated 9 months ago
- Improving Alignment and Robustness with Circuit Breakers☆208Updated 8 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆72Updated 2 months ago
- Repository for the Bias Benchmark for QA dataset.☆116Updated last year
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆61Updated 4 months ago
- AI Logging for Interpretability and Explainability🔬☆119Updated 11 months ago
- [arXiv preprint] Official Repository for "Evaluating Language Models as Synthetic Data Generators"☆33Updated 5 months ago
- ☆23Updated 7 months ago
- [ACL 2024] LangBridge: Multilingual Reasoning Without Multilingual Supervision☆89Updated 7 months ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆69Updated last year
- PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆35Updated 7 months ago
- [ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models☆76Updated last month
- Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".☆76Updated last year
- Code associated with Tuning Language Models by Proxy (Liu et al., 2024)☆110Updated last year
- [NeurIPS 2024] Train LLMs with diverse system messages reflecting individualized preferences to generalize to unseen system messages☆47Updated 6 months ago
- Steering Llama 2 with Contrastive Activation Addition☆155Updated last year
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆100Updated last year