mlcommons / modelbenchLinks
Run safety benchmarks against AI models and view detailed reports showing how well they performed.
☆114Updated last week
Alternatives and similar repositories for modelbench
Users that are interested in modelbench are comparing it to the libraries listed below
Sorting:
- Improving Alignment and Robustness with Circuit Breakers☆252Updated last year
- Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs☆102Updated last year
- ☆152Updated 3 years ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆120Updated 10 months ago
- Evaluating LLMs with fewer examples☆170Updated last year
- A simple evaluation of generative language models and safety classifiers.☆81Updated 3 weeks ago
- A Comprehensive Assessment of Trustworthiness in GPT Models☆311Updated last year
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆157Updated 7 months ago
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆176Updated last year
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆100Updated 2 years ago
- Collection of evals for Inspect AI☆325Updated this week
- ☆107Updated last year
- [ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets☆218Updated 2 years ago
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆70Updated last year
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆123Updated last year
- ☆43Updated last year
- ☆59Updated 2 years ago
- The Prism Alignment Project☆87Updated last year
- The code for the paper ROUTERBENCH: A Benchmark for Multi-LLM Routing System☆153Updated last year
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆337Updated last year
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆161Updated 6 months ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆28Updated last year
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆123Updated last year
- ☆49Updated last year
- This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Aji…☆237Updated 2 years ago
- Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs☆302Updated last year
- RuLES: a benchmark for evaluating rule-following in language models☆245Updated 10 months ago
- Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"☆71Updated last year
- Scalable Meta-Evaluation of LLMs as Evaluators☆43Updated last year
- ☆116Updated last year