centerforaisafety / Intro_to_ML_SafetyLinks

☆73

Alternatives and similar repositories for Intro_to_ML_Safety

Users that are interested in Intro_to_ML_Safety are comparing it to the libraries listed below

Sorting:

ethz-spylab / rlhf_trojan_competition
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
☆114Updated last year
thestephencasper / benchmarking_interpretability
☆34Updated last year
thestephencasper / latent_adversarial_training
☆22Updated last year
aengusl / latent-adversarial-training
☆41Updated 10 months ago
mmazeika / tdc-starter-kit
Starter kit and data loading code for the Trojan Detection Challenge NeurIPS 2022 competition
☆33Updated 2 years ago
centerforaisafety / tdc2023-starter-kit
This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.
☆90Updated last year
rishub-tamirisa / tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆59Updated 2 months ago
SchwinnL / circuit-breakers-eval
Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting
☆18Updated 3 months ago
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆227Updated 10 months ago
max-andr / adversarial-random-search-gpt4
Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]
☆43Updated last year
MadryLab / trak
A fast, effective data attribution method for neural networks in PyTorch
☆215Updated 8 months ago
anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆113Updated last year
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆134Updated 2 months ago
maxdreyer / PURE
Repository for PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits, accepted at CVPR 2024 XAI4CV Works…
☆18Updated last year
JonasGeiping / carving
Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
☆70Updated last year
UKGovernmentBEIS / control-arena
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆82Updated this week
Confirm-Solutions / flrt
Fluent student-teacher redteaming
☆22Updated last year
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆120Updated 5 months ago
thestephencasper / explore_establish_exploit_llms
☆31Updated 2 years ago
ejones313 / auditing-llms
☆56Updated 2 years ago
iamgroot42 / mimir
Python package for measuring memorization in LLMs.
☆162Updated 3 weeks ago
chrisliu298 / awesome-representation-engineering
A resource repository for representation engineering in large language models
☆129Updated 8 months ago
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆73Updated 2 years ago
CLAS2024 / starter-kit
☆39Updated 9 months ago
slavachalnev / SAE-TS
Improving Steering Vectors by Targeting Sparse Autoencoder Features
☆24Updated 8 months ago
YanNeu / spurious_imagenet
Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features in ImageNet
☆32Updated last year
adamkarvonen / SAEBench
☆109Updated 3 weeks ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆273Updated 7 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆112Updated last month
facebookresearch / advprompter
Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873
☆159Updated last year