stanford-crfm/air-bench-2024

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/stanford-crfm/air-bench-2024)

stanford-crfm / air-bench-2024

AIR-Bench 2024 is a safety benchmark that aligns with emerging government regulations and company policies

☆30

Alternatives and similar repositories for air-bench-2024

Users that are interested in air-bench-2024 are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

AI-secure / MMDT
View on GitHub
Comprehensive Assessment of Trustworthiness in Multimodal Foundation Models
☆29Mar 15, 2025Updated last year
yjw1029 / Self-Reminder
View on GitHub
Code for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder" in NMI.
☆57Nov 13, 2023Updated 2 years ago
apartresearch / DarkBench
View on GitHub
Benchmarking Dark Patterns in LLMs (ICLR 2025)
☆18Mar 29, 2025Updated last year
UCSB-AI / SafeKey
View on GitHub
[EMNLP 2025] Official code for the paper "SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning"
☆16May 12, 2026Updated 2 months ago
lapisrocks / rpo
View on GitHub
Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"
☆62Aug 8, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
myracheng / lm_caricature
View on GitHub
code and data associated with CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations
☆11Oct 13, 2023Updated 2 years ago
paul-rottger / xstest
View on GitHub
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆141Feb 24, 2025Updated last year
kyleeasterly / loose-lips-multipliers
View on GitHub
DEF CON 31 AI Village - LLMs: Loose Lips Multipliers
☆10Aug 16, 2023Updated 2 years ago
olivier-s-j / ADASYN
View on GitHub
Adaptive Synthetic Sampling Approach for Imbalanced Learning
☆13Jun 16, 2013Updated 13 years ago
UCSC-VLAA / STAR-1
View on GitHub
[AAAI'26 Oral] Official Implementation of STAR-1: Safer Alignment of Reasoning LLMs with 1K Data
☆38Apr 7, 2025Updated last year
HKUST-KnowComp / PrivaCI-Bench
View on GitHub
☆23Apr 23, 2025Updated last year
lapisrocks / DiscreteAdversarialDistillation
View on GitHub
[NeurIPS 2023] Official repository for "Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models"
☆11Jun 18, 2024Updated 2 years ago
gangiswag / infogent
View on GitHub
☆24Mar 1, 2025Updated last year
lucweytingh / ARL-UvA
View on GitHub
A reproduced PyTorch implementation of the Adversarially Reweighted Learning (ARL) model, originally presented in "Fairness without Demog…
☆21Jan 30, 2021Updated 5 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
qingjiesjtu / USC
View on GitHub
This is the code repository of our submission: Understanding the Dark Side of LLMs’ Intrinsic Self-Correction.
☆61Dec 20, 2024Updated last year
whj363636 / Self-Ensemble-Adversarial-Training
View on GitHub
SEAT
☆21Oct 10, 2023Updated 2 years ago
anaistack / cefr-asag-corpus
View on GitHub
A corpus of short answers written by learners of English and graded with CEFR levels
☆12Dec 17, 2021Updated 4 years ago
GeorgeVern / lmcor
View on GitHub
Code for the EACL 2024 paper: "Small Language Models Improve Giants by Rewriting Their Outputs"
☆12Apr 20, 2024Updated 2 years ago
naver-ai / ALMoST
View on GitHub
☆24Dec 2, 2023Updated 2 years ago
LaSTUS-TALN-UPF / TSAR-2022-Shared-Task
View on GitHub
TSAR2022 Shared Task on Lexical Simplification - Datasets and Evaluation scripts
☆10Oct 27, 2022Updated 3 years ago
felixbinder / introspection_self_prediction
View on GitHub
Code for experiments on self-prediction as a way to measure introspection in LLMs
☆16Dec 10, 2024Updated last year
phueb / Zorro
View on GitHub
Grammar test suite for masked language models
☆10Jan 1, 2023Updated 3 years ago
talaugust / definition-complexity
View on GitHub
☆14Jun 13, 2022Updated 4 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
RU-System-Software-and-Security / NONE
View on GitHub
☆10Oct 31, 2022Updated 3 years ago
sani903 / OpenAgentSafety
View on GitHub
A Framework for Evaluating AI Agent Safety in Realistic Environments
☆38Oct 2, 2025Updated 9 months ago
thu-coai / Agent-SafetyBench
View on GitHub
☆151Aug 11, 2025Updated 11 months ago
IntologyAI / NanoGPT-Bench
View on GitHub
☆21Jul 3, 2026Updated 3 weeks ago
SORRY-Bench / sorry-bench
View on GitHub
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆83Mar 1, 2025Updated last year
siangooding / readability_scroll
View on GitHub
Dataset containing scroll interactions of 598 partcipants reading advanced and elementary texts from the OneStopEnglish corpus
☆16Dec 27, 2021Updated 4 years ago
beinborn / relative_importance
View on GitHub
☆17Jun 17, 2025Updated last year
MadryLab / pretraining-distribution-shift-robustness
View on GitHub
☆14Mar 4, 2024Updated 2 years ago
AI-secure / DecodingTrust
View on GitHub
A Comprehensive Assessment of Trustworthiness in GPT Models
☆314Sep 16, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
TomSheng21 / AdaptGuard
View on GitHub
ICCV 2023 - AdaptGuard: Defending Against Universal Attacks for Model Adaptation
☆11Dec 23, 2023Updated 2 years ago
Betswish / MIRAGE
View on GitHub
Easy-to-use MIRAGE code for faithful answer attribution in RAG applications. Paper: https://aclanthology.org/2024.emnlp-main.347/
☆25Mar 10, 2025Updated last year
usc-isi-i2 / kgtk-similarity
View on GitHub
☆28Sep 13, 2022Updated 3 years ago
OpenGPTX / lm-evaluation-harness
View on GitHub
A framework for few-shot evaluation of autoregressive language models.
☆13Jul 14, 2025Updated last year
andyrdt / refusal_direction
View on GitHub
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆424Jun 13, 2025Updated last year
tim-learn / Generate_list
View on GitHub
Generate custom text files for dataloader within UDA methods
☆14May 24, 2023Updated 3 years ago
mlcommons / modelbench
View on GitHub
Run safety benchmarks against AI models and view detailed reports showing how well they performed.
☆129Updated this week