☆60Mar 9, 2023Updated 2 years ago
Alternatives and similar repositories for auditing-llms
Users that are interested in auditing-llms are comparing it to the libraries listed below
Sorting:
- Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"☆22Sep 21, 2025Updated 5 months ago
- ☆48Feb 8, 2025Updated last year
- ☆35May 21, 2025Updated 9 months ago
- ☆196Nov 26, 2023Updated 2 years ago
- ☆48Sep 29, 2024Updated last year
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆90May 19, 2024Updated last year
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆116Jun 13, 2024Updated last year
- Code and data to go with the Zhu et al. paper "An Objective for Nuanced LLM Jailbreaks"☆36Dec 18, 2024Updated last year
- About Official PyTorch implementation of "Query-Efficient Black-Box Red Teaming via Bayesian Optimization" (ACL'23)☆15Jul 9, 2023Updated 2 years ago
- ☆20Feb 11, 2024Updated 2 years ago
- ☆70Feb 4, 2024Updated 2 years ago
- ☆25May 31, 2024Updated last year
- ☆44Apr 25, 2023Updated 2 years ago
- https://icml.cc/virtual/2023/poster/24354☆10Aug 15, 2023Updated 2 years ago
- Code for the CSF 2018 paper "Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting"☆39Jan 28, 2019Updated 7 years ago
- ☆43May 23, 2023Updated 2 years ago
- ☆698Jul 2, 2025Updated 8 months ago
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- [ICML 2023] Are Diffusion Models Vulnerable to Membership Inference Attacks?☆43Sep 4, 2024Updated last year
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]☆377Jan 23, 2025Updated last year
- ☆52Aug 17, 2024Updated last year
- ☆13Jan 14, 2026Updated last month
- ☆10Feb 3, 2025Updated last year
- An Empirical Study of Memorization in NLP (ACL 2022)☆13Jun 22, 2022Updated 3 years ago
- This repository includes code for the paper "Does Localization Inform Editing? Surprising Differences in Where Knowledge Is Stored vs. Ca…☆61May 9, 2023Updated 2 years ago
- ☆284Mar 2, 2024Updated 2 years ago
- Understanding Rare Spurious Correlations in Neural Network☆12Jun 5, 2022Updated 3 years ago
- This project proposed a method to defense against adversarial attack. By combining the proposed preprocessing method with an adversariall…☆10Oct 4, 2018Updated 7 years ago
- [EMNLP 2022] Distillation-Resistant Watermarking (DRW) for Model Protection in NLP☆13Aug 17, 2023Updated 2 years ago
- ☆15Apr 7, 2023Updated 2 years ago
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal☆864Aug 16, 2024Updated last year
- A fast + lightweight implementation of the GCG algorithm in PyTorch☆319May 13, 2025Updated 9 months ago
- Source code for the TMLR paper "Black-Box Prompt Learning for Pre-trained Language Models"☆57Sep 7, 2023Updated 2 years ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆67Jun 9, 2025Updated 8 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆128Feb 24, 2025Updated last year
- A re-implementation of the "Red Teaming Language Models with Language Models" paper by Perez et al., 2022☆35Oct 9, 2023Updated 2 years ago
- Implementation of BEAST adversarial attack for language models (ICML 2024)☆90May 14, 2024Updated last year
- Interpreting Learned Search and Planning: Reverse-engineering recurrent convolutional networks (DRC) that play Sokoban☆17Jun 29, 2025Updated 8 months ago
- Documenting large text datasets 🖼️ 📚☆14Dec 17, 2024Updated last year