anthropic-experimental/automated-auditing

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/anthropic-experimental/automated-auditing)

anthropic-experimental / automated-auditing

Prompts used in the Automated Auditing Blog Post

☆167

Alternatives and similar repositories for automated-auditing

Users that are interested in automated-auditing are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

anthropic-experimental / agentic-misalignment
View on GitHub
☆643Jun 19, 2025Updated last year
safety-research / false-facts
View on GitHub
☆50Jul 4, 2025Updated last year
safety-research / safety-tooling
View on GitHub
Inference API for many LLMs and other useful tools for empirical research
☆134May 29, 2026Updated last month
meridianlabs-ai / inspect_petri
View on GitHub
An alignment auditing agent capable of quickly exploring alignment hypothesis
☆1,273Updated this week
Harvard-CS-2881 / harvard-cs-2881-hw0
View on GitHub
harvard-cs-2881-classroom-hw0-c2881-hw0 created by GitHub Classroom
☆16Jul 26, 2025Updated 11 months ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
safety-research / finetuning-auditor
View on GitHub
Auditing agents for fine-tuning safety
☆21Oct 21, 2025Updated 9 months ago
safety-research / open-source-alignment-faking
View on GitHub
Open Source Replication of Anthropic's Alignment Faking Paper
☆58Apr 4, 2025Updated last year
SchwinnL / circuit-breakers-eval
View on GitHub
Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting
☆18Apr 15, 2025Updated last year
safety-research / A3
View on GitHub
☆19Dec 29, 2025Updated 6 months ago
UKGovernmentBEIS / control-arena
View on GitHub
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆213Updated this week
goodfire-ai / scribe
View on GitHub
☆86Feb 18, 2026Updated 5 months ago
safety-research / safety-examples
View on GitHub
☆31Nov 11, 2025Updated 8 months ago
MinhxLe / subliminal-learning
View on GitHub
☆152Feb 10, 2026Updated 5 months ago
safety-research / SHADE-Arena
View on GitHub
☆26Jun 22, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
cywinski / eliciting-secret-knowledge
View on GitHub
Code repository for "Eliciting Secret Knowledge from Language Models"
☆24Mar 30, 2026Updated 3 months ago
UKGovernmentBEIS / hibayes
View on GitHub
☆53May 17, 2026Updated 2 months ago
curt-tigges / probity
View on GitHub
☆19Apr 10, 2025Updated last year
recursivelabsai / Prompt-Theory
View on GitHub
Recursive Labs introduces Prompt Theory, a framework that establishes structural similarities between artificial intelligence prompting s…
☆22Jul 8, 2025Updated last year
TransluceAI / introspective-interp
View on GitHub
Repository for "Training Language Models To Explain Their Own Computations"
☆23Jul 7, 2026Updated 2 weeks ago
TransluceAI / docent
View on GitHub
☆114Updated this week
safety-research / selective-gradient-masking
View on GitHub
Training Transformers with knowledge localization (SGTM)
☆54Jan 11, 2026Updated 6 months ago
EleutherAI / delphi
View on GitHub
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆268Updated this week
dnakov / llm-asi-arch
View on GitHub
🤖 Complete reproduction of 'AlphaGo Moment for Model Architecture Discovery' using MLX-LM instead of GPT-4. Autonomous neural architectu…
☆30Jul 27, 2025Updated 11 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
safety-research / bloom
View on GitHub
bloom - evaluate any behavior immediately 🌸🌱
☆1,371May 7, 2026Updated 2 months ago
science-of-finetuning / sparsity-artifacts-crosscoders
View on GitHub
Code for the "Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning" paper.
☆17Jul 6, 2026Updated 2 weeks ago
Zyphra / zcookbook
View on GitHub
Training hybrid models for dummies.
☆31Nov 1, 2025Updated 8 months ago
redwoodresearch / Text-Steganography-Benchmark
View on GitHub
Code for Preventing Language Models From Hiding Their Reasoning, which evaluates defenses against LLM steganography.
☆25Jan 26, 2024Updated 2 years ago
UKGovernmentBEIS / aisi-sandboxing
View on GitHub
The open-source AISI toolkit for sandboxing agentic evaluations
☆26Aug 7, 2025Updated 11 months ago
tim-hua-01 / steering-eval-awareness-public
View on GitHub
☆17Mar 16, 2026Updated 4 months ago
curt-tigges / crosslayer-coding
View on GitHub
☆18Jul 9, 2025Updated last year
dreadnode / agent-lens
View on GitHub
Agent observability and replay tooling for AI safety & interpretability research.
☆109Jun 19, 2026Updated last month
safety-research / automated-w2s-research
View on GitHub
☆285Apr 13, 2026Updated 3 months ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
goodfire-ai / sdxl-turbo-interpretability
View on GitHub
☆49May 27, 2025Updated last year
METR / RE-Bench
View on GitHub
☆145Oct 16, 2025Updated 9 months ago
redwoodresearch / mlab
View on GitHub
Machine Learning for Alignment Bootcamp
☆84Apr 27, 2022Updated 4 years ago
hearbenchmark / hear2021-submitted-models
View on GitHub
Open-source audio embedding models, submitted to the HEAR 2021 challenge
☆11Feb 15, 2026Updated 5 months ago
harish-kamath / rqae
View on GitHub
Residual Quantization Autoencoder, used for interpreting LLMs
☆14Jan 1, 2025Updated last year
Trustworthy-ML-Lab / Describe-and-Dissect
View on GitHub
[TMLR 25] An automated method for explaining complex neuron behaviors in deep vision models using large language models
☆11Feb 20, 2025Updated last year
allenai / fluid-benchmarking
View on GitHub
Fluid Language Model Benchmarking
☆29Sep 16, 2025Updated 10 months ago