andyzoujm/breaking-llama-guard

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/andyzoujm/breaking-llama-guard)

andyzoujm / breaking-llama-guard

Code to break Llama Guard

☆32

Alternatives and similar repositories for breaking-llama-guard

Users that are interested in breaking-llama-guard are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

tml-epfl / long-is-more-for-alignment
View on GitHub
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]
☆21May 2, 2024Updated 2 years ago
princeton-polaris-lab / Evaluating-Durable-Safeguards
View on GitHub
[ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs
☆13Jun 20, 2025Updated last year
jam3scampbell / llama-lying
View on GitHub
Code for our paper "Localizing Lying in Llama"
☆15Apr 24, 2025Updated last year
SORRY-Bench / sorry-bench
View on GitHub
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆83Mar 1, 2025Updated last year
eth-lre / LLM_ICL
View on GitHub
ACL24
☆11Jun 7, 2024Updated 2 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
alexandrasouly / strongreject
View on GitHub
Repository for "StrongREJECT for Empty Jailbreaks" paper
☆157Nov 3, 2024Updated last year
ethz-spylab / realistic-adv-examples
View on GitHub
Code for the paper "Evading Black-box Classifiers Without Breaking Eggs" [SaTML 2024]
☆21Apr 15, 2024Updated 2 years ago
princeton-nlp / benign-data-breaks-safety
View on GitHub
☆47Oct 1, 2024Updated last year
adobe-research / beacon-aug
View on GitHub
Cross-library augmentation toolbox supporting 300 operators over 8 libraries + AI transforms
☆12Jan 11, 2022Updated 4 years ago
yuqiChen94 / Swat_Simulator
View on GitHub
☆14Dec 27, 2020Updated 5 years ago
y0mingzhang / diffuse-distributions
View on GitHub
Forcing Diffuse Distributions out of Language Models
☆18Sep 10, 2024Updated last year
paul-rottger / xstest
View on GitHub
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆141Feb 24, 2025Updated last year
vinid / safety-tuned-llamas
View on GitHub
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆95May 9, 2024Updated 2 years ago
aengusl / latent-adversarial-training
View on GitHub
☆48Sep 29, 2024Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
MadryLab / datamodels-data
View on GitHub
Data for "Datamodels: Predicting Predictions with Training Data"
☆97May 25, 2023Updated 3 years ago
pHaeusler / tinycatstories
View on GitHub
☆10Aug 14, 2023Updated 2 years ago
declare-lab / red-instruct
View on GitHub
Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
☆111Mar 8, 2024Updated 2 years ago
pHaeusler / tic_tac_transformer
View on GitHub
☆11Sep 26, 2023Updated 2 years ago
centerforaisafety / tdc2023-starter-kit
View on GitHub
This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.
☆92May 19, 2024Updated 2 years ago
boyiwei / alignment-attribution-code
View on GitHub
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆91Mar 30, 2025Updated last year
jylee425 / mobilesafetybench
View on GitHub
Evaluating Safety of Autonomous Agents in Mobile Device Control (AAAI 2026 AI Alignment Track)
☆34Jan 28, 2026Updated 6 months ago
pratyushmaini / llm_dataset_inference
View on GitHub
Official Repository for Dataset Inference for LLMs
☆41Jul 25, 2024Updated 2 years ago
limenlp / safer-instruct
View on GitHub
This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"
☆17Feb 22, 2024Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
chawins / pal
View on GitHub
PAL: Proxy-Guided Black-Box Attack on Large Language Models
☆57Aug 17, 2024Updated last year
OSU-NLP-Group / EIA_against_webagent
View on GitHub
☆40Oct 2, 2024Updated last year
LLM-Tuning-Safety / LLMs-Finetuning-Safety
View on GitHub
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆358Feb 23, 2024Updated 2 years ago
ethz-spylab / autoadvexbench
View on GitHub
☆42May 21, 2025Updated last year
max-andr / adversarial-random-search-gpt4
View on GitHub
Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]
☆43Apr 28, 2024Updated 2 years ago
Unispac / Fight-Poison-With-Poison
View on GitHub
Code repository for the paper --- [USENIX Security 2023] Towards A Proactive ML Approach for Detecting Backdoor Poison Samples
☆31Jul 11, 2023Updated 3 years ago
pdejorge / N-FGSM
View on GitHub
Official repo for the paper "Make Some Noise: Reliable and Efficient Single-Step Adversarial Training" (https://arxiv.org/abs/2202.01181)
☆25Oct 17, 2022Updated 3 years ago
OSU-NLP-Group / SeeActChromeExtension
View on GitHub
☆18Jan 3, 2025Updated last year
ErxinYu / CoSafe-Dataset
View on GitHub
☆13Nov 12, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
thu-coai / JailbreakDefense_GoalPriority
View on GitHub
[ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
☆29Jul 9, 2024Updated 2 years ago
scaleapi / browser-art
View on GitHub
☆37Mar 6, 2025Updated last year
guanghelee / neurips19-certificates-of-robustness
View on GitHub
"Tight Certificates of Adversarial Robustness for Randomly Smoothed Classifiers" (NeurIPS 2019, previously called "A Stratified Approach …
☆17Nov 16, 2019Updated 6 years ago
gortizji / linearized-networks
View on GitHub
Source code of "What can linearized neural networks actually say about generalization?
☆20Oct 21, 2021Updated 4 years ago
tml-epfl / llm-past-tense
View on GitHub
Does Refusal Training in LLMs Generalize to the Past Tense? [ICLR 2025]
☆78Jan 23, 2025Updated last year
Unispac / shallow-vs-deep-alignment
View on GitHub
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆190Apr 23, 2025Updated last year
chanchimin / AgentMonitor
View on GitHub
Codes for our paper "AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems"
☆13Dec 13, 2024Updated last year