RobustNLP / CipherChatLinks

A framework to evaluate the generalization capability of safety alignment for LLMs

☆616

Alternatives and similar repositories for CipherChat

Users that are interested in CipherChat are comparing it to the libraries listed below

Sorting:

CHATS-lab / persuasive_jailbreaker
Persuasive Jailbreaker: we can persuade LLMs to jailbreak them!
☆330Updated 2 weeks ago
thu-coai / SafetyBench
Official github repo for SafetyBench, a comprehensive benchmark to evaluate LLMs' safety. [ACL 2024]
☆260Updated 3 months ago
whitzard-ai / jade-db
"他山之石、可以攻玉"：复旦白泽智能发布面向国内开源和国外商用大模型的Demo数据集JADE-DB
☆468Updated last week
tmlr-group / DeepInception
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
☆164Updated last year
sherdencooper / GPTFuzz
Official repo for GPTFUZZER : Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
☆533Updated last year
SheltonLiu-N / AutoDAN
[ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language M…
☆390Updated 9 months ago
EasyJailbreak / EasyJailbreak
An easy-to-use Python framework to generate adversarial jailbreak prompts.
☆747Updated 7 months ago
thu-coai / ShieldLM
ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors [EMNLP 2024 Findings]
☆214Updated last year
chawins / llm-sp
Papers and resources related to the security and privacy of LLMs 🤖
☆539Updated 4 months ago
LLM-Tuning-Safety / LLMs-Finetuning-Safety
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆328Updated last year
niconi19 / LLM-Conversation-Safety
[NAACL2024] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
☆106Updated last year
patrickrchao / JailbreakingLLMs
☆639Updated 4 months ago
X-PLUG / CValues
面向中文大模型价值观的评估与对齐研究
☆542Updated 2 years ago
JailbreakBench / jailbreakbench
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]
☆446Updated 6 months ago
Allen-piexl / JailbreakZoo
☆153Updated last year
NJUNLP / ReNeLLM
The official implementation of our NAACL 2024 paper "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Lang…
☆141Updated 2 months ago
usail-hkust / JailTrickBench
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)
☆152Updated 11 months ago
Princeton-SysML / Jailbreak_LLM
☆187Updated last year
IS2Lab / S-Eval
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models
☆99Updated 2 weeks ago
thu-coai / Safety-Prompts
Chinese safety prompts for evaluating and improving the safety of LLMs. 中文安全prompts，用于评估和提升大模型的安全性。
☆1,089Updated last year
WeOpenML / PandaLM
☆922Updated last year
uw-nsl / ArtPrompt
[ACL24] Official Repo of Paper `ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs`
☆90Updated 2 months ago
Libr-AI / do-not-answer
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
☆295Updated last year
RICommunity / TAP
TAP: An automated jailbreaking method for black-box LLMs
☆194Updated 10 months ago
thunlp / Advbench
Code and data of the EMNLP 2022 paper "Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversaria…
☆61Updated 2 years ago
PKU-YuanGroup / Hallucination-Attack
Attack to induce LLMs within hallucinations
☆161Updated last year
THUDM / MathGLM
Official Pytorch Implementation for MathGLM
☆328Updated last year
OpenSafetyLab / SALAD-BENCH
【ACL 2024】 SALAD benchmark & MD-Judge
☆163Updated 7 months ago
AI-secure / DecodingTrust
A Comprehensive Assessment of Trustworthiness in GPT Models
☆306Updated last year
thu-coai / BPO
☆330Updated last year