shadowkiller33 / Language_attackLinks

A repo for LLM jailbreak

☆14

Alternatives and similar repositories for Language_attack

Users that are interested in Language_attack are comparing it to the libraries listed below

Sorting:

DAMO-NLP-SG / multilingual-safety-for-LLMs
[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"
☆94Updated last year
NLie2 / what_features_jailbreak_LLMs
☆18Updated 7 months ago
SALT-NLP / chain-of-thought-bias
☆28Updated last year
vinid / safety-tuned-llamas
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆89Updated last year
declare-lab / red-instruct
Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
☆107Updated last year
genglinliu / UnknownBench
Repo for paper: Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge
☆14Updated last year
s-ball-10 / jailbreak_dynamics
☆22Updated last year
RUCAIBox / Language-Specific-Neurons
☆87Updated 11 months ago
Jarviswang94 / Multilingual_safety_benchmark
Multilingual safety benchmark for Large Language Models
☆54Updated last year
Princeton-SysML / Jailbreak_LLM
☆188Updated 2 years ago
chujiezheng / LLM-Safeguard
Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"
☆100Updated 6 months ago
HillZhang1999 / ICD
Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"
☆69Updated last year
HKUST-KnowComp / LLM-Multistep-Jailbreak
Code for Findings-EMNLP 2023 paper: Multi-step Jailbreaking Privacy Attacks on ChatGPT
☆35Updated 2 years ago
balevinstein / Probes
☆57Updated 2 years ago
ruiqi-zhong / nlparam
Augmenting Statistical Models with Natural Language Parameters
☆29Updated last year
XuandongZhao / weak-to-strong
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆89Updated 6 months ago
launchnlp / LitCab
☆25Updated 5 months ago
HKUST-KnowComp / Knowledge-Constrained-Decoding
Official Code for EMNLP2023 Main Conference paper: "KCTS: Knowledge-Constrained Tree Search Decoding with Token-Level Hallucination Detec…
☆30Updated 2 years ago
McGill-NLP / AdversarialTriggers
TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models
☆19Updated 3 months ago
eric-mitchell / serac
Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model
☆69Updated 3 years ago
Yangyi-Chen / PaperList-Trustworthy-Applications
Mostly recording papers about models' trustworthy applications. Intending to include topics like model evaluation & analysis, security, c…
☆21Updated 2 years ago
Cohere-Labs-Community / goodtriever
Code for "Goodtriever: Toxicity Mitigation with Retrieval-augmented Language Models"
☆23Updated last year
joeljang / knowledge-unlearning
[ACL 2023] Knowledge Unlearning for Mitigating Privacy Risks in Language Models
☆84Updated last year
YisongMiao / DiSQ-Score
The Dataset and Official Implementation for <Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models’ Understandi…
☆18Updated last year
nyu-mll / BBQ
Repository for the Bias Benchmark for QA dataset.
☆132Updated last year
xirui-li / DrAttack
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
☆66Updated last year
declare-lab / resta
Restore safety in fine-tuned language models through task arithmetic
☆29Updated last year
XMUDeepLIT / SSR
Code for "Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal" (ACL 2024)
☆16Updated last year
yuzhaouoe / SAE-based-representation-engineering
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆67Updated last year
theshi-1128 / jailbreak-bench
The most comprehensive and accurate LLM jailbreak attack benchmark by far
☆21Updated 8 months ago