wicai24/DOOR-Alignment

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/wicai24/DOOR-Alignment)

wicai24 / DOOR-Alignment

☆20

Alternatives and similar repositories for DOOR-Alignment

Users that are interested in DOOR-Alignment are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

UCSB-AI / MSSBench
View on GitHub
[ICLR 2025] Official codebase for the ICLR 2025 paper "Multimodal Situational Safety"
☆36Jun 23, 2025Updated last year
yuki-younai / MTSA
View on GitHub
offical implementation of MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
☆16Jun 2, 2025Updated last year
yuchenlwu / PersonalizedSafety
View on GitHub
[NeurIPS 2025]: Personalized Safety in LLMs — A Benchmark and a Planning-Based Agent Approach
☆16Oct 30, 2025Updated 8 months ago
InvokerStark / OverKill
View on GitHub
☆15Jun 13, 2024Updated 2 years ago
jpzhang1810 / LDM-Robustness
View on GitHub
Pytorch implementation for the pilot study on the robustness of latent diffusion models.
☆13Jun 20, 2023Updated 3 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
Jayfeather1024 / Backdoor-Enhanced-Alignment
View on GitHub
☆24Dec 8, 2024Updated last year
facebookresearch / multimodal-fusion-jailbreaks
View on GitHub
Official repository for the paper "Gradient-based Jailbreak Images for Multimodal Fusion Models" (https//arxiv.org/abs/2410.03489)
☆20Oct 22, 2024Updated last year
AI45Lab / CodeAttack
View on GitHub
[ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
☆61Oct 1, 2025Updated 9 months ago
Jinxiaolong1129 / Foot-in-the-door-Jailbreak
View on GitHub
☆23May 14, 2025Updated last year
AlignmentResearch / scaling-poisoning
View on GitHub
☆17Nov 18, 2024Updated last year
AI45Lab / ActorAttack
View on GitHub
☆135Jun 29, 2026Updated 3 weeks ago
SaFo-Lab / ReasoningBomb
View on GitHub
[CCS 2026] The official implementation of our CCS 2026 paper "ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathological…
☆15Jun 24, 2026Updated 3 weeks ago
chawins / adv-part-model
View on GitHub
Code for a research paper "Part-Based Models Improve Adversarial Robustness" (ICLR 2023)
☆21Sep 16, 2023Updated 2 years ago
weiyezhimeng / SQL-Injection-Jailbreak
View on GitHub
☆22Jul 26, 2025Updated 11 months ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
AI-secure / MMDT
View on GitHub
Comprehensive Assessment of Trustworthiness in Multimodal Foundation Models
☆29Mar 15, 2025Updated last year
LiuAmber / RAHF
View on GitHub
[ACL 2024 main] Aligning Large Language Models with Human Preferences through Representation Engineering (https://aclanthology.org/2024.…
☆28Sep 25, 2024Updated last year
CryptoAILab / FigStep
View on GitHub
[AAAI'25 (Oral)] Jailbreaking Large Vision-language Models via Typographic Visual Prompts
☆211Jun 26, 2025Updated last year
HanjiangHu / NBF-LLM
View on GitHub
The official code for "Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks".
☆18Jun 24, 2026Updated 3 weeks ago
ethz-spylab / autoadvexbench
View on GitHub
☆42May 21, 2025Updated last year
Hongcheng-Gao / HAVEN
View on GitHub
Code and data for paper "Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation".
☆25Oct 22, 2025Updated 9 months ago
ShenzheZhu / JailDAM
View on GitHub
[COLM 2025] JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
☆26Nov 25, 2025Updated 7 months ago
6zHAOyi / BadVision
View on GitHub
This is an official code repository for CVPR 2025 paper BadVision.
☆15Nov 18, 2025Updated 8 months ago
lasr-spelling / sae-spelling
View on GitHub
Code for the paper "A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders"
☆15Dec 28, 2025Updated 6 months ago
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
MurrayTom / SG-Bench
View on GitHub
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
☆26Nov 29, 2024Updated last year
JasonGross / guarantees-based-mechanistic-interpretability
View on GitHub
☆18Updated this week
thu-coai / AISafetyLab
View on GitHub
AISafetyLab: A comprehensive framework covering safety attack, defense, evaluation and paper list.
☆248Apr 21, 2026Updated 3 months ago
apple / ml-vfm-kt
View on GitHub
☆14Jul 2, 2024Updated 2 years ago
scaleapi / mrt
View on GitHub
https://scale.com/research/mrt
☆20Mar 16, 2026Updated 4 months ago
thu-ml / STAIR
View on GitHub
Official codebase for "STAIR: Improving Safety Alignment with Introspective Reasoning"
☆89Feb 26, 2025Updated last year
TeunvdWeij / sandbagging
View on GitHub
☆20Nov 15, 2024Updated last year
tmlr-group / BayesianLM
View on GitHub
[NeurIPS 2024 Oral] "Bayesian-Guided Label Mapping for Visual Reprogramming"
☆12Dec 20, 2024Updated last year
tmllab / 2025_ICLR_PiF
View on GitHub
☆40May 17, 2025Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
centerforaisafety / HarmBench
View on GitHub
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
☆1,011Aug 16, 2024Updated last year
choidami / inductive-oocr
View on GitHub
☆16Mar 22, 2025Updated last year
pralab / transfer-bench
View on GitHub
Repository for the evaluation of the transferabiilty of adversarial examples.
☆21Nov 25, 2025Updated 7 months ago
zhu-minjun / SafetyLock
View on GitHub
Your finetuned model's back to its original safety standards faster than you can say "SafetyLock"!
☆11Oct 16, 2024Updated last year
hwanchang00 / ChatInject
View on GitHub
[ICLR 2026] Official implementation of "ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents"
☆17Mar 23, 2026Updated 4 months ago
leopoldwhite / Awesome-Inference-Time-Trustworthiness
View on GitHub
☆15May 15, 2026Updated 2 months ago
PeterWang512 / AttributeByUnlearning
View on GitHub
Code for the paper "Data Attribution for Text-to-Image Models by Unlearning Synthesized Images."
☆17May 23, 2025Updated last year