McGill-NLP / AdversarialTriggersLinks

TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models

☆19

Alternatives and similar repositories for AdversarialTriggers

Users that are interested in AdversarialTriggers are comparing it to the libraries listed below

Sorting:

locuslab / acr-memorization
☆37Updated 11 months ago
declare-lab / resta
Restore safety in fine-tuned language models through task arithmetic
☆29Updated last year
azshue / AutoPoison
The official repository of the paper "On the Exploitability of Instruction Tuning".
☆65Updated last year
ejones313 / auditing-llms
☆59Updated 2 years ago
weichen-yu / LM-Extraction
☆43Updated 2 years ago
milesaturpin / cot-unfaithfulness
☆51Updated 2 years ago
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆99Updated last year
unbiarirang / Fixed-Input-Parameterization
This repository contains the official code for the paper: "Prompt Injection: Parameterization of Fixed Inputs"
☆32Updated last year
vinid / safety-tuned-llamas
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆89Updated last year
dannyallover / overthinking_the_truth
☆29Updated last year
MaheepChaudhary / SAE-Ravel
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆12Updated 9 months ago
JasonForJoy / Model-Editing-Hurt
EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue
☆37Updated 5 months ago
princeton-nlp / benign-data-breaks-safety
☆41Updated last year
XuandongZhao / weak-to-strong
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆89Updated 6 months ago
Vaidehi99 / InfoDeletionAttacks
☆47Updated 9 months ago
weizeming / momentum-attack-llm
☆23Updated 10 months ago
sail-sg / I-FSJ
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆65Updated 10 months ago
pratyushmaini / llm_dataset_inference
Official Repository for Dataset Inference for LLMs
☆43Updated last year
poloclub / llm-landscape
NeurIPS'24 - LLM Safety Landscape
☆31Updated 3 weeks ago
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆84Updated 8 months ago
yihuaihong / ConceptVectors
[EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆38Updated 3 months ago
tml-epfl / long-is-more-for-alignment
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]
☆19Updated last year
thestephencasper / explore_establish_exploit_llms
☆31Updated 2 years ago
arobey1 / advbench
☆44Updated 2 years ago
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 8 months ago
thestephencasper / latent_adversarial_training
☆23Updated last year
tml-epfl / icl-alignment
Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]
☆31Updated 9 months ago
peterljq / Parsimonious-Concept-Engineering
PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)
☆40Updated last year
ethz-spylab / rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
☆62Updated last year
UCSC-VLAA / AttnGCG-attack
☆19Updated 5 months ago