mlcommons / modelbenchLinks

Run safety benchmarks against AI models and view detailed reports showing how well they performed.

☆99

Alternatives and similar repositories for modelbench

Users that are interested in modelbench are comparing it to the libraries listed below

Sorting:

GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆225Updated 10 months ago
allenai / wildguard
Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
☆87Updated 8 months ago
AI-secure / DecodingTrust
A Comprehensive Assessment of Trustworthiness in GPT Models
☆299Updated 10 months ago
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆132Updated 2 months ago
allenai / safety-eval
A simple evaluation of generative language models and safety classifiers.
☆59Updated last year
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆106Updated 5 months ago
JonasGeiping / carving
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆70Updated last year
LLM-Tuning-Safety / LLMs-Finetuning-Safety
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆314Updated last year
anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆111Updated last year
ejones313 / auditing-llms
☆55Updated 2 years ago
ryoungj / ToolEmu
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
☆152Updated last year
declare-lab / red-instruct
Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
☆103Updated last year
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆86Updated last year
openai / moderation-api-release
☆136Updated 3 years ago
UKGovernmentBEIS / inspect_evals
Collection of evals for Inspect AI
☆201Updated this week
ScalerLab / JudgeBench
☆91Updated 9 months ago
facebookresearch / SecAlign
Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"
☆63Updated 2 weeks ago
Babelscape / ALERT
Official repository for the paper "ALERT: A Comprehensive Benchmark for Assessing Large Language Models’ Safety through Red Teaming"
☆44Updated 10 months ago
bertiev / SimpleSafetyTests
☆17Updated last year
haizelabs / redteaming-resistance-benchmark
☆45Updated last year
felipemaiapolo / tinyBenchmarks
Evaluating LLMs with fewer examples
☆160Updated last year
allenai / wildteaming
☆32Updated 11 months ago
Libr-AI / do-not-answer
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
☆270Updated last year
Princeton-SysML / Jailbreak_LLM
☆178Updated last year
neelsjain / BYOD
The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"
☆107Updated last year
thestephencasper / explore_establish_exploit_llms
☆31Updated 2 years ago
ethz-spylab / rlhf_trojan_competition
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
☆114Updated last year
pratyushmaini / llm_dataset_inference
Official Repository for Dataset Inference for LLMs
☆36Updated last year
kaistAI / FLASK
[ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
☆218Updated last year
HannahKirk / prism-alignment
The Prism Alignment Project
☆79Updated last year