Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting
☆18Apr 15, 2025Updated 10 months ago
Alternatives and similar repositories for circuit-breakers-eval
Users that are interested in circuit-breakers-eval are comparing it to the libraries listed below
Sorting:
- Fluent student-teacher redteaming☆23Jul 25, 2024Updated last year
- A method for training neural networks that are provably robust to adversarial attacks. [IJCAI 2019]☆10Sep 3, 2019Updated 6 years ago
- Code to replicate the Representation Noising paper and tools for evaluating defences against harmful fine-tuning☆24Dec 12, 2024Updated last year
- ☆53May 24, 2023Updated 2 years ago
- Notes on Direct Preference Optimization☆24Apr 14, 2024Updated last year
- Comprehensive Assessment of Trustworthiness in Multimodal Foundation Models☆27Mar 15, 2025Updated 11 months ago
- Official repo for the paper "Make Some Noise: Reliable and Efficient Single-Step Adversarial Training" (https://arxiv.org/abs/2202.01181)☆25Oct 17, 2022Updated 3 years ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆67Jun 9, 2025Updated 8 months ago
- ☆35May 21, 2025Updated 9 months ago
- Auditing agents for fine-tuning safety☆20Oct 21, 2025Updated 4 months ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆87Jul 24, 2025Updated 7 months ago
- [ICLR 2022 official code] Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness?☆29Mar 15, 2022Updated 3 years ago
- ☆12Sep 21, 2023Updated 2 years ago
- ☆75Feb 18, 2026Updated 2 weeks ago
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]☆379Jan 23, 2025Updated last year
- 🎹🎵🎶 A platform to make Original and Cover Visible and Valuable.☆13Nov 8, 2022Updated 3 years ago
- ☆24Feb 18, 2026Updated 2 weeks ago
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)☆163Nov 30, 2024Updated last year
- ☆12Oct 29, 2023Updated 2 years ago
- Official implementation of the paper "On the Importance of Environments in Human-Robot Coordination", published in RSS 2021.☆16May 1, 2024Updated last year
- Code for experiments on self-prediction as a way to measure introspection in LLMs☆16Dec 10, 2024Updated last year
- A modern look at the relationship between sharpness and generalization [ICML 2023]☆43Sep 11, 2023Updated 2 years ago
- Improving transparency of large language models' reasoning☆14Nov 25, 2025Updated 3 months ago
- ☆12Mar 4, 2025Updated last year
- ☆10May 27, 2024Updated last year
- Video about NP-completeness, circuit SAT and "reversing time"☆15Aug 18, 2024Updated last year
- Scratchpad/Chain-of-Thought Prompts☆12Jun 6, 2022Updated 3 years ago
- ACL24☆11Jun 7, 2024Updated last year
- ☆15Apr 26, 2025Updated 10 months ago
- ☆12Jun 9, 2025Updated 8 months ago
- ☆19Mar 18, 2025Updated 11 months ago
- ☆25Sep 3, 2025Updated 6 months ago
- Proof of concept code for VoteAgain paper☆10Jul 23, 2023Updated 2 years ago
- V2 of CodeGraphy. VSCode force-based graph extension for displaying file connections☆13Jun 10, 2023Updated 2 years ago
- Code used to produce experimental results for the paper "Deep Structured Prediction with Nonlinear Output Activations"☆11May 6, 2019Updated 6 years ago
- Code for EMNLP'24 paper - On Diversified Preferences of Large Language Model Alignment☆16Aug 6, 2024Updated last year
- A pipeline for phylogenetic diversity analysis of GBIF-mediated data☆13May 30, 2025Updated 9 months ago
- ☆11Apr 21, 2023Updated 2 years ago
- ☆10Oct 31, 2022Updated 3 years ago