ASTRAL-Group / SVIPLinks

SVIP: Towards Verifiable Inference of Open-Source Large Language Models

☆11

Alternatives and similar repositories for SVIP

Users that are interested in SVIP are comparing it to the libraries listed below

Sorting:

ethz-spylab / rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
☆54Updated last year
Jayfeather1024 / Backdoor-Enhanced-Alignment
☆20Updated 6 months ago
git-disl / Lisa
This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)
☆21Updated 8 months ago
git-disl / Safety-Tax
This is the official code for the paper "Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable".
☆16Updated 2 months ago
princeton-polaris-lab / Evaluating-Durable-Safeguards
[ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs
☆13Updated 3 months ago
inspire-group / DP-RandP
[NeurIPS 2023] Differentially Private Image Classification by Learning Priors from Random Processes
☆12Updated last year
vfleaking / PTST
Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"
☆18Updated last year
shizhouxing / Fast-Certified-Robust-Training
[NeurIPS 2021] Fast Certified Robust Training with Short Warmup
☆24Updated 2 years ago
AI45Lab / CodeAttack
[ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
☆42Updated 7 months ago
fjxmlzn / private-evolution-papers
The collection of papers about Private Evolution
☆16Updated 3 weeks ago
git-disl / Booster
This is the official code for the paper "Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturba…
☆27Updated 2 months ago
aounon / certified-llm-safety
☆39Updated 9 months ago
AISafety-HKUST / Backdoor_Safety_Tuning
Backdoor Safety Tuning (NeurIPS 2023 & 2024 Spotlight)
☆26Updated 6 months ago
facebookresearch / jailbreak-objectives
Code and data to go with the Zhu et al. paper "An Objective for Nuanced LLM Jailbreaks"
☆31Updated 5 months ago
reds-lab / Meta-Sift
The official implementation of USENIX Security'23 paper "Meta-Sift" -- Ten minutes or less to find a 1000-size or larger clean subset on …
☆18Updated 2 years ago
wagner-group / MarkMyWords
☆29Updated last year
aengusl / latent-adversarial-training
☆39Updated 8 months ago
umd-huang-lab / VLM-Poisoning
Code for Neurips 2024 paper "Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models"
☆49Updated 4 months ago
lapisrocks / rpo
Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"
☆53Updated 9 months ago
tmlr-group / G-effect
[ICLR 2025] "Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond"
☆11Updated 3 months ago
s-ball-10 / jailbreak_dynamics
☆15Updated 11 months ago
litian96 / AdaDPS
Private Adaptive Optimization with Side Information (ICML '22)
☆16Updated 2 years ago
rmin2000 / adv_tracing
Identification of the Adversary from a Single Adversarial Example (ICML 2023)
☆10Updated 10 months ago
domenicrosati / representation-noising
Code to replicate the Representation Noising paper and tools for evaluating defences against harmful fine-tuning
☆19Updated 5 months ago
rain152 / PAT
[NeurIPS 2024] Fight Back Against Jailbreaking via Prompt Adversarial Tuning
☆10Updated 7 months ago
neelsjain / baseline-defenses
Official Code for "Baseline Defenses for Adversarial Attacks Against Aligned Language Models"
☆24Updated last year
csdongxian / ANP_backdoor
Codes for NeurIPS 2021 paper "Adversarial Neuron Pruning Purifies Backdoored Deep Models"
☆57Updated 2 years ago
OPTML-Group / Unlearn-WorstCase
[ECCV24] "Challenging Forgets: Unveiling the Worst-Case Forget Sets in Machine Unlearning" by Chongyu Fan*, Jiancheng Liu*, Alfred Hero, …
☆21Updated last week
zhxieml / remiss-jailbreak
☆30Updated 11 months ago
PKU-ML / PAT
Code for NeurIPS 2024 Paper "Fight Back Against Jailbreaking via Prompt Adversarial Tuning"
☆12Updated last month