aengusl/latent-adversarial-training

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/aengusl/latent-adversarial-training)

aengusl / latent-adversarial-training

☆48

Alternatives and similar repositories for latent-adversarial-training

Users that are interested in latent-adversarial-training are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

thestephencasper / latent_adversarial_training
View on GitHub
☆24Jul 25, 2024Updated 2 years ago
Jayfeather1024 / Backdoor-Enhanced-Alignment
View on GitHub
☆24Dec 8, 2024Updated last year
Lslland / T-Vaccine
View on GitHub
☆19Jun 21, 2025Updated last year
milesaturpin / cot-unfaithfulness
View on GitHub
☆57Oct 23, 2023Updated 2 years ago
ejnnr / cupbearer
View on GitHub
A library for mechanistic anomaly detection
☆22Jan 9, 2025Updated last year
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
GraySwanAI / circuit-breakers
View on GitHub
Improving Alignment and Robustness with Circuit Breakers
☆266Sep 24, 2024Updated last year
reds-lab / BEEAR
View on GitHub
This is the official Gtihub repo for our paper: "BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Lang…
☆23Jul 3, 2024Updated 2 years ago
princeton-polaris-lab / Evaluating-Durable-Safeguards
View on GitHub
[ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs
☆13Jun 20, 2025Updated last year
CryptoAILab / misalignment
View on GitHub
[NDSS'25] The official implementation of safety misalignment.
☆19Jan 8, 2025Updated last year
OODRobustBench / OODRobustBench
View on GitHub
OODRobustBench: a Benchmark and Large-Scale Analysis of Adversarial Robustness under Distribution Shift. ICML 2024 and ICLRW-DMLR 2024
☆23Jul 25, 2024Updated 2 years ago
conditionWang / Data_Centric_AI_IP_Protection
View on GitHub
This is the repository that introduces research topics related to protecting intellectual property (IP) of AI from a data-centric perspec…
☆23Oct 30, 2023Updated 2 years ago
inspire-group / RobustRAG
View on GitHub
☆31Sep 15, 2024Updated last year
SORRY-Bench / sorry-bench
View on GitHub
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆83Mar 1, 2025Updated last year
rmin2000 / adv_tracing
View on GitHub
Identification of the Adversary from a Single Adversarial Example (ICML 2023)
☆10Jul 15, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
rishub-tamirisa / tamper-resistance
View on GitHub
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆68Jun 9, 2025Updated last year
locuslab / breaking-poisoned-classifier
View on GitHub
Code for paper "Poisoned classifiers are not only backdoored, they are fundamentally broken"
☆26Jan 7, 2022Updated 4 years ago
Gwinhen / MOTH
View on GitHub
This is the implementation for IEEE S&P 2022 paper "Model Orthogonalization: Class Distance Hardening in Neural Networks for Better Secur…
☆11Aug 24, 2022Updated 3 years ago
Unispac / Circumventing-Backdoor-Defenses
View on GitHub
Code Repository for the Paper ---Revisiting the Assumption of Latent Separability for Backdoor Defenses (ICLR 2023)
☆47Feb 28, 2023Updated 3 years ago
git-disl / Vaccine
View on GitHub
This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)
☆51Jan 15, 2026Updated 6 months ago
w-yibo / R1-Compress
View on GitHub
[NeurIPS 2025@FoRLM] R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search
☆17Jan 24, 2026Updated 6 months ago
LukeBailey181 / obfuscated-activations
View on GitHub
Codebase for Obfuscated Activations Bypass LLM Latent-Space Defenses
☆31Feb 11, 2025Updated last year
MaheepChaudhary / SAE-Ravel
View on GitHub
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆13Jan 26, 2025Updated last year
AI-secure / Knowledge-Enhanced-Machine-Learning-Pipeline
View on GitHub
Repository for Knowledge Enhanced Machine Learning Pipeline (KEMLP)
☆10Jun 5, 2021Updated 5 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
ethz-spylab / realistic-adv-examples
View on GitHub
Code for the paper "Evading Black-box Classifiers Without Breaking Eggs" [SaTML 2024]
☆21Apr 15, 2024Updated 2 years ago
zhangrui4041 / Instruction_Backdoor_Attack
View on GitHub
☆25Aug 21, 2024Updated last year
AISafety-HKUST / Backdoor_Safety_Tuning
View on GitHub
Backdoor Safety Tuning (NeurIPS 2023 & 2024 Spotlight)
☆27Nov 18, 2024Updated last year
hammlab / PoisoningCertifiedDefenses
View on GitHub
How Robust are Randomized Smoothing based Defenses to Data Poisoning? (CVPR 2021)
☆14Jul 16, 2021Updated 5 years ago
wonderNefelibata / Awesome-LRM-Safety
View on GitHub
Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …
☆84Updated this week
Butanium / tiny-activation-dashboard
View on GitHub
A tiny easily hackable implementation of a feature dashboard.
☆17Oct 21, 2025Updated 9 months ago
ejones313 / auditing-llms
View on GitHub
☆61Mar 9, 2023Updated 3 years ago
EleutherAI / deep-ignorance
View on GitHub
☆20Jan 7, 2026Updated 6 months ago
Unispac / shallow-vs-deep-alignment
View on GitHub
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆190Apr 23, 2025Updated last year
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
TDteach / Demon-in-the-Variant
View on GitHub
☆13Oct 21, 2021Updated 4 years ago
facebookresearch / jailbreak-objectives
View on GitHub
Code and data to go with the Zhu et al. paper "An Objective for Nuanced LLM Jailbreaks"
☆37Jul 2, 2026Updated 3 weeks ago
centerforaisafety / tdc2023-starter-kit
View on GitHub
This is the starter kit for the Trojan Detection Challenge 2023 (LLM Edition), a NeurIPS 2023 competition.
☆92May 19, 2024Updated 2 years ago
real-absolute-AI / Unnatural_Language
View on GitHub
The official repository of 'Unnatural Language Are Not Bugs but Features for LLMs'
☆24May 20, 2025Updated last year
safety-research / open-source-alignment-faking
View on GitHub
Open Source Replication of Anthropic's Alignment Faking Paper
☆58Apr 4, 2025Updated last year
ethz-spylab / unlearning-vs-safety
View on GitHub
☆27Oct 6, 2024Updated last year
inspire-group / tta_risk
View on GitHub
☆15Jun 6, 2023Updated 3 years ago