centerforaisafety / Intro_to_ML_SafetyView external linksLinks
☆77May 31, 2023Updated 2 years ago
Alternatives and similar repositories for Intro_to_ML_Safety
Users that are interested in Intro_to_ML_Safety are comparing it to the libraries listed below
Sorting:
- Machine Learning for Alignment Bootcamp☆81Apr 27, 2022Updated 3 years ago
- ☆20Feb 17, 2023Updated 3 years ago
- A collection of different ways to implement accessing and modifying internal model activations for LLMs☆20Oct 18, 2024Updated last year
- Machine Learning for Alignment Bootcamp☆27Mar 7, 2024Updated last year
- ☆16Dec 9, 2023Updated 2 years ago
- Code for "Automatic Circuit Finding and Faithfulness"☆16Jul 11, 2024Updated last year
- Machine Learning for Alignment Bootcamp (MLAB).☆31Jan 24, 2022Updated 4 years ago
- Cross-library augmentation toolbox supporting 300 operators over 8 libraries + AI transforms☆13Jan 11, 2022Updated 4 years ago
- [NeurIPS 2023] Differentially Private Image Classification by Learning Priors from Random Processes☆12Jun 12, 2023Updated 2 years ago
- ☆19Jun 10, 2024Updated last year
- Work in progress! I don't recommend looking at the code right now.☆24Dec 3, 2025Updated 2 months ago
- A School for All Seasons on Trustworthy Machine Learning☆12Jun 30, 2021Updated 4 years ago
- Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.☆238Aug 11, 2025Updated 6 months ago
- [ICLR'26 Oral] RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments☆32Feb 9, 2026Updated last week
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]☆21May 2, 2024Updated last year
- The Happy Faces Benchmark☆15Jul 20, 2023Updated 2 years ago
- Representation Engineering: A Top-Down Approach to AI Transparency☆946Aug 14, 2024Updated last year
- source for llmsec.net☆16Jul 24, 2024Updated last year
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆116Jun 13, 2024Updated last year
- ☆25May 31, 2024Updated last year
- Inspect: A framework for large language model evaluations☆1,737Updated this week
- ☆30Jun 19, 2023Updated 2 years ago
- Adversarially Robust Neural Network on MNIST.☆63Feb 4, 2022Updated 4 years ago
- Trained model weights, training and evaluation code from the paper "A simple way to make neural networks robust against diverse image cor…☆62May 24, 2023Updated 2 years ago
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆70Sep 25, 2024Updated last year
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆854Aug 16, 2024Updated last year
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆217Feb 9, 2026Updated last week
- ☆35May 21, 2025Updated 8 months ago
- Official implementation for "Sparse Concept Bottleneck Models: Gumbel Tricks in Contrastive Learning"☆12Jun 20, 2025Updated 7 months ago
- On the effectiveness of adversarial training against common corruptions [UAI 2022]☆30May 16, 2022Updated 3 years ago
- AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM☆84Nov 3, 2024Updated last year
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆206Jul 19, 2024Updated last year
- [CVPR 2024] Friendly Sharpness-Aware Minimization☆36Oct 29, 2024Updated last year
- ☆32Jan 13, 2025Updated last year
- ☆47Jan 14, 2026Updated last month
- Lottery Ticket Adaptation☆40Nov 20, 2024Updated last year
- Notebooks for reproducing the paper "Computer Vision with a Single (Robust) Classifier"☆129Oct 24, 2019Updated 6 years ago
- Training vision models with full-batch gradient descent and regularization☆39Feb 14, 2023Updated 3 years ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆158May 29, 2025Updated 8 months ago