Code to break Llama Guard
☆32Dec 7, 2023Updated 2 years ago
Alternatives and similar repositories for breaking-llama-guard
Users that are interested in breaking-llama-guard are comparing it to the libraries listed below
Sorting:
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]☆21May 2, 2024Updated last year
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆76Mar 1, 2025Updated last year
- ☆10Oct 31, 2022Updated 3 years ago
- ACL24☆11Jun 7, 2024Updated last year
- [ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs☆13Jun 20, 2025Updated 8 months ago
- ☆14Dec 27, 2020Updated 5 years ago
- Code for our paper "Localizing Lying in Llama"☆13Apr 24, 2025Updated 10 months ago
- ☆36May 21, 2025Updated 9 months ago
- [NeurIPS 2023] Differentially Private Image Classification by Learning Priors from Random Processes☆12Jun 12, 2023Updated 2 years ago
- Repository for "StrongREJECT for Empty Jailbreaks" paper☆152Nov 3, 2024Updated last year
- Forcing Diffuse Distributions out of Language Models☆18Sep 10, 2024Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆174Apr 23, 2025Updated 10 months ago
- Code for the paper "Evading Black-box Classifiers Without Breaking Eggs" [SaTML 2024]☆21Apr 15, 2024Updated last year
- ☆44Oct 1, 2024Updated last year
- ☆48Sep 29, 2024Updated last year
- Official Repository for Dataset Inference for LLMs☆42Jul 25, 2024Updated last year
- The library for symbolic interval☆22Jun 23, 2020Updated 5 years ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆129Feb 24, 2025Updated last year
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Apr 28, 2024Updated last year
- Source code of "What can linearized neural networks actually say about generalization?☆20Oct 21, 2021Updated 4 years ago
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆89Mar 30, 2025Updated 11 months ago
- TheNZT is a powerful multi-agent finance query processing system designed to process and respond to finance-related queries efficiently. …☆30Feb 3, 2026Updated last month
- This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.☆90May 19, 2024Updated last year
- PAL: Proxy-Guided Black-Box Attack on Large Language Models☆57Aug 17, 2024Updated last year
- ☆21Oct 9, 2020Updated 5 years ago
- Data for "Datamodels: Predicting Predictions with Training Data"☆97May 25, 2023Updated 2 years ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆93May 9, 2024Updated last year
- Official repo for the paper "Make Some Noise: Reliable and Efficient Single-Step Adversarial Training" (https://arxiv.org/abs/2202.01181)☆25Oct 17, 2022Updated 3 years ago
- Code for ICLR 2025 Failures to Find Transferable Image Jailbreaks Between Vision-Language Models☆37Jun 1, 2025Updated 9 months ago
- Language models scale reliably with over-training and on downstream tasks☆100Apr 2, 2024Updated last year
- Tools for the CADCD dataset☆24Aug 30, 2019Updated 6 years ago
- The repository contains code for Adaptive Data Optimization☆32Dec 9, 2024Updated last year
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆343Feb 23, 2024Updated 2 years ago
- Universal Neurons in GPT2 Language Models☆30May 28, 2024Updated last year
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆108Mar 8, 2024Updated 2 years ago
- β-CROWN: Efficient Bound Propagation with Per-neuron Split Constraints for Neural Network Verification☆31Nov 9, 2021Updated 4 years ago
- ☆35Feb 20, 2025Updated last year
- ☆197Nov 26, 2023Updated 2 years ago
- [NeurIPS 2024] Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling☆34Nov 8, 2024Updated last year