princeton-polaris-lab/Evaluating-Durable-Safeguards

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/princeton-polaris-lab/Evaluating-Durable-Safeguards)

princeton-polaris-lab / Evaluating-Durable-Safeguards

[ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs

☆13

Alternatives and similar repositories for Evaluating-Durable-Safeguards

Users that are interested in Evaluating-Durable-Safeguards are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

boyiwei / CoTaEval
View on GitHub
[NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models
☆17Jul 17, 2024Updated 2 years ago
Jayfeather1024 / Backdoor-Enhanced-Alignment
View on GitHub
☆24Dec 8, 2024Updated last year
rishub-tamirisa / tamper-resistance
View on GitHub
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆68Jun 9, 2025Updated last year
princeton-nlp / benign-data-breaks-safety
View on GitHub
☆47Oct 1, 2024Updated last year
SORRY-Bench / sorry-bench
View on GitHub
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆83Mar 1, 2025Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
boyiwei / alignment-attribution-code
View on GitHub
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆91Mar 30, 2025Updated last year
domenicrosati / representation-noising
View on GitHub
Code to replicate the Representation Noising paper and tools for evaluating defences against harmful fine-tuning
☆24Dec 12, 2024Updated last year
IBM / NeuralFuse
View on GitHub
[NeurIPS'24] "NeuralFuse: Learning to Recover the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes" by Hao-Lun …
☆10Sep 18, 2025Updated 10 months ago
RU-System-Software-and-Security / NONE
View on GitHub
☆10Oct 31, 2022Updated 3 years ago
git-disl / Vaccine
View on GitHub
This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)
☆51Jan 15, 2026Updated 6 months ago
aladinD / SafeMERGE
View on GitHub
Code for SafeMERGE (ICLR 2025).
☆15Apr 1, 2025Updated last year
jam3scampbell / llama-lying
View on GitHub
Code for our paper "Localizing Lying in Llama"
☆15Apr 24, 2025Updated last year
hammlab / PoisoningCertifiedDefenses
View on GitHub
How Robust are Randomized Smoothing based Defenses to Data Poisoning? (CVPR 2021)
☆14Jul 16, 2021Updated 5 years ago
inspire-group / DP-RandP
View on GitHub
[NeurIPS 2023] Differentially Private Image Classification by Learning Priors from Random Processes
☆12Jun 12, 2023Updated 3 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
LLLeoLi / LARF
View on GitHub
[EMNLP 2025] Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
☆15Jul 22, 2025Updated last year
EnnengYang / RepresentationSurgery
View on GitHub
Representation Surgery for Multi-Task Model Merging. ICML, 2024.
☆49Oct 10, 2024Updated last year
Lslland / T-Vaccine
View on GitHub
☆19Jun 21, 2025Updated last year
IBM / AutoVP
View on GitHub
[ICLR24] "AutoVP: An Automated Visual Prompting Framework and Benchmark" by Hsi-Ai Tsao*, Lei Hsiung*, Pin-Yu Chen, Sijia Liu, and Tsung-…
☆23Sep 18, 2025Updated 10 months ago
Unispac / Circumventing-Backdoor-Defenses
View on GitHub
Code Repository for the Paper ---Revisiting the Assumption of Latent Separability for Backdoor Defenses (ICLR 2023)
☆47Feb 28, 2023Updated 3 years ago
SchwinnL / circuit-breakers-eval
View on GitHub
Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting
☆18Apr 15, 2025Updated last year
git-disl / Lisa
View on GitHub
This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)
☆29Sep 10, 2024Updated last year
scaleapi / mrt
View on GitHub
https://scale.com/research/mrt
☆20Mar 16, 2026Updated 4 months ago
safety-research / finetuning-auditor
View on GitHub
Auditing agents for fine-tuning safety
☆21Oct 21, 2025Updated 9 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
IBM / composite-adv
View on GitHub
[CVPR23] "Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations" by Lei Hsi…
☆23Sep 17, 2025Updated 10 months ago
Unispac / shallow-vs-deep-alignment
View on GitHub
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆190Apr 23, 2025Updated last year
MadryLab / D3M
View on GitHub
Debiasing Through Data Attribution
☆13May 23, 2024Updated 2 years ago
ethz-spylab / unlearning-vs-safety
View on GitHub
☆27Oct 6, 2024Updated last year
Hazelsuko07 / InstaHide_Challenge
View on GitHub
A challenge to investigate the security of the InstaHide protocol.
☆12Dec 7, 2020Updated 5 years ago
Unispac / Fight-Poison-With-Poison
View on GitHub
Code repository for the paper --- [USENIX Security 2023] Towards A Proactive ML Approach for Detecting Backdoor Poison Samples
☆31Jul 11, 2023Updated 3 years ago
andyzoujm / breaking-llama-guard
View on GitHub
Code to break Llama Guard
☆32Dec 7, 2023Updated 2 years ago
GraySwanAI / circuit-breakers
View on GitHub
Improving Alignment and Robustness with Circuit Breakers
☆266Sep 24, 2024Updated last year
git-disl / Virus
View on GitHub
This is the official code for the paper "Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation"
☆56Feb 2, 2025Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
git-disl / Safety-Tax
View on GitHub
This is the official code for the paper "Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable".
☆35Mar 11, 2025Updated last year
xyq7 / Human-Contribution-Measurement
View on GitHub
☆13Jun 4, 2025Updated last year
Hazelsuko07 / EMA
View on GitHub
☆16Sep 8, 2021Updated 4 years ago
RUCAIBox / FIGA
View on GitHub
[ICLR 2024] This is the official implementation for the paper: "Beyond imitation: Leveraging fine-grained quality signals for alignment"
☆10May 5, 2024Updated 2 years ago
poloclub / llm-landscape
View on GitHub
NeurIPS'24 - LLM Safety Landscape
☆40Oct 21, 2025Updated 9 months ago
AI-secure / MMDT
View on GitHub
Comprehensive Assessment of Trustworthiness in Multimodal Foundation Models
☆29Mar 15, 2025Updated last year
zhuotongchen / Towards-Robust-Neural-Networks-via-Close-loop-Control
View on GitHub
☆13Jan 30, 2021Updated 5 years ago