poloclub/llm-self-defense

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/poloclub/llm-self-defense)

poloclub / llm-self-defense

LLM Self Defense: By Self Examination, LLMs know they are being tricked

☆52

Alternatives and similar repositories for llm-self-defense

Users that are interested in llm-self-defense are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

thu-coai / JailbreakDefense_GoalPriority
View on GitHub
[ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
☆29Jul 9, 2024Updated 2 years ago
uw-nsl / SafeDecoding
View on GitHub
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆154Jul 19, 2024Updated 2 years ago
STAIR-BUPT / STAIR-LLMGuardrails
View on GitHub
☆12Sep 29, 2024Updated last year
SafeAILab / RAIN
View on GitHub
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆99May 23, 2024Updated 2 years ago
XHMY / AutoDefense
View on GitHub
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
☆68Jan 15, 2026Updated 6 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
YihanWang617 / LLM-Jailbreaking-Defense-Backtranslation
View on GitHub
Code for paper "Defending aginast LLM Jailbreaking via Backtranslation"
☆34Aug 16, 2024Updated last year
yjw1029 / Self-Reminder
View on GitHub
Code for our paper "Defending ChatGPT against Jailbreak Attack via Self-Reminder" in NMI.
☆57Nov 13, 2023Updated 2 years ago
UCSB-NLP-Chang / SemanticSmooth
View on GitHub
Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing'
☆24Jun 9, 2024Updated 2 years ago
usail-hkust / JailTrickBench
View on GitHub
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)
☆167Nov 30, 2024Updated last year
SheltonLiu-N / AutoDAN
View on GitHub
[ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…
☆453Jan 22, 2025Updated last year
tmlr-group / DeepInception
View on GitHub
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
☆176Feb 20, 2024Updated 2 years ago
SheltonLiu-N / Universal-Prompt-Injection
View on GitHub
The official implementation of our pre-print paper "Automatic and Universal Prompt Injection Attacks against Large Language Models".
☆73Oct 23, 2024Updated last year
Aatrox103 / SAP
View on GitHub
☆49May 9, 2024Updated 2 years ago
aounon / certified-llm-safety
View on GitHub
☆53Aug 10, 2024Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
sail-sg / I-FSJ
View on GitHub
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆65Jan 11, 2025Updated last year
yzmar4real / ai_cybersecurity_compliance
View on GitHub
AI-Powered CyberSecurity Compliance: Boost Network Security with OpenAI GPT-3.5-turbo
☆10May 18, 2023Updated 3 years ago
YihanWang617 / llm-jailbreaking-defense
View on GitHub
A lightweight library for large laguage model (LLM) jailbreaking defense.
☆61Sep 11, 2025Updated 10 months ago
ZiyueWang25 / llm-security-challenge
View on GitHub
Can Large Language Models Solve Security Challenges? We test LLMs' ability to interact and break out of shell environments using the Over…
☆13Aug 21, 2023Updated 2 years ago
InvokerStark / OverKill
View on GitHub
☆15Jun 13, 2024Updated 2 years ago
ethz-spylab / rlhf-poisoning
View on GitHub
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
☆67Apr 24, 2024Updated 2 years ago
akshayballal95 / private_gpt
View on GitHub
☆20Jun 4, 2023Updated 3 years ago
JailbreakBench / jailbreakbench
View on GitHub
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]
☆634Apr 4, 2025Updated last year
theshi-1128 / llm-defense
View on GitHub
An easy-to-use Python framework to defend against jailbreak prompts.
☆21Mar 22, 2025Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
theshi-1128 / jailbreak-bench
View on GitHub
The most comprehensive and accurate LLM jailbreak attack benchmark by far
☆21Mar 22, 2025Updated last year
AI-secure / CoPur
View on GitHub
CoPur: Certifiably Robust Collaborative Inference via Feature Purification (NeurIPS 2022)
☆11Dec 7, 2022Updated 3 years ago
hyggs / Anomaly-Detection-and-Attack-Identification-in-Network-Traffic-Based-on-Graph
View on GitHub
A project from EECS6414M of Winter 2020 at York University
☆11Mar 26, 2020Updated 6 years ago
zbh2047 / L_inf-dist-net
View on GitHub
[ICML 2021] This is the official github repo for training L_inf dist nets with high certified accuracy.
☆41Mar 16, 2022Updated 4 years ago
brian-lou / Training-Data-Extraction-Attack-on-LLMs
View on GitHub
This project explores training data extraction attacks on the LLaMa 7B, GPT-2XL, and GPT-2-IMDB models to discover memorized content usin…
☆15Jun 15, 2023Updated 3 years ago
WangCheng0116 / Awesome-LRMs-Safety
View on GitHub
Official repository for "Safety in Large Reasoning Models: A Survey" - Exploring safety risks, attacks, and defenses for Large Reasoning …
☆90Aug 25, 2025Updated 10 months ago
berndprach / AOL
View on GitHub
Code for paper Almost-Orthogonal Layers for Efficient General-Purpose Lipschitz Networks
☆13Aug 9, 2022Updated 3 years ago
eurekayuan / RigorLLM
View on GitHub
Implementation for "RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content"
☆24Jul 28, 2024Updated last year
VITA-Group / Robust_Weight_Signatures
View on GitHub
[ICML 2023] "Robust Weight Signatures: Gaining Robustness as Easy as Patching Weights?" by Ruisi Cai, Zhenyu Zhang, Zhangyang Wang
☆16May 4, 2023Updated 3 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
zzp1012 / Cross-Task-Linearity
View on GitHub
[ICML 2024] Code release for "On the Emergence of Cross-Task Linearity in Pretraining-Finetuning Paradigm"
☆11Feb 20, 2025Updated last year
lawrennd / neurips2014
View on GitHub
Notebooks for managing NeurIPS 2014 and analysing the NeurIPS experiment.
☆13May 22, 2024Updated 2 years ago
tic-top / LoraCSE
View on GitHub
😜Constrative Learning of Sentence Embedding using LoRA (EECS487 final project)
☆13Apr 19, 2023Updated 3 years ago
ledllm / ledllm
View on GitHub
☆24Jun 16, 2024Updated 2 years ago
smartswords / privateGPT
View on GitHub
☆15May 10, 2023Updated 3 years ago
joey-wang123 / DRO-Task-free
View on GitHub
Code for Improving Task-free Continual Learning by Distributionally Robust Memory Evolution (ICML 2022)
☆11Aug 20, 2022Updated 3 years ago
weizeming / momentum-attack-llm
View on GitHub
☆25Jan 17, 2025Updated last year