xyq7/GradSafe

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/xyq7/GradSafe)

xyq7 / GradSafe

Official Code for ACL 2024 paper "GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis"

☆68

Alternatives and similar repositories for GradSafe

Users that are interested in GradSafe are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

xyq7 / Human-Contribution-Measurement
View on GitHub
☆13Jun 4, 2025Updated last year
W-Ted / UDC-NeRF
View on GitHub
Official code for ICCV2023 paper: Learning Unified Decompositional and Compositional NeRF for Editable Novel View Synthesis
☆34Dec 27, 2023Updated 2 years ago
uw-nsl / SafeDecoding
View on GitHub
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆154Jul 19, 2024Updated 2 years ago
thu-coai / SafeUnlearning
View on GitHub
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
☆32Jul 9, 2024Updated 2 years ago
pipilurj / bootstrapped-preference-optimization-BPO
View on GitHub
code for "Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization"
☆63Aug 23, 2024Updated last year
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
rt219 / LatentGuard
View on GitHub
This is the official repo of the paper "Latent Guard: a Safety Framework for Text-to-image Generation"
☆54Oct 24, 2024Updated last year
pipilurj / MLLM-protector
View on GitHub
The official repository for paper "MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance"
☆46Apr 21, 2024Updated 2 years ago
sterzhang / PVIT
View on GitHub
Official Repository of Personalized Visual Instruct Tuning
☆34Mar 6, 2025Updated last year
prismformore / SDSEN
View on GitHub
☆20May 26, 2020Updated 6 years ago
LLLeoLi / LARF
View on GitHub
[EMNLP 2025] Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
☆15Jul 22, 2025Updated 11 months ago
pipilurj / ROBOT
View on GitHub
☆27Apr 11, 2023Updated 3 years ago
ydyjya / LLM-IHS-Explanation
View on GitHub
☆60Jun 13, 2024Updated 2 years ago
RU-System-Software-and-Security / NONE
View on GitHub
☆10Oct 31, 2022Updated 3 years ago
UCSB-NLP-Chang / SemanticSmooth
View on GitHub
Implementation of paper 'Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing'
☆24Jun 9, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
eurekayuan / RigorLLM
View on GitHub
Implementation for "RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content"
☆24Jul 28, 2024Updated last year
AI45Lab / ActorAttack
View on GitHub
☆135Jun 29, 2026Updated 3 weeks ago
idrl-lab / AAAD
View on GitHub
☆16Apr 23, 2025Updated last year
shiningrain / JailGuard
View on GitHub
☆32Mar 16, 2025Updated last year
chujiezheng / LLM-Safeguard
View on GitHub
Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"
☆108May 20, 2025Updated last year
Princeton-SysML / Jailbreak_LLM
View on GitHub
☆203Nov 26, 2023Updated 2 years ago
uw-nsl / ArtPrompt
View on GitHub
[ACL24] Official Repo of Paper `ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs`
☆102Aug 15, 2025Updated 11 months ago
Zhang-Yihao / Adversarial-Representation-Engineering
View on GitHub
Official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering.
☆20Dec 6, 2024Updated last year
prismformore / DiffusionMTL
View on GitHub
Code of our CVPR2024 paper - DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data
☆60Mar 25, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
W-Ted / GScream
View on GitHub
Official code for ECCV2024 paper: GScream: Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal
☆104Nov 25, 2025Updated 7 months ago
y4ney / LLM-Security
View on GitHub
LLM 安全资料收集与学习
☆25Jun 28, 2024Updated 2 years ago
DYR1 / MoGU
View on GitHub
Our research proposes a novel MoGU framework that improves LLMs' safety while preserving their usability.
☆18Jan 14, 2025Updated last year
prismformore / expAT
View on GitHub
TIP: Bi-directional Exponential Angular Triplet Loss for RGB-Infrared Person Re-Identification
☆21Mar 29, 2021Updated 5 years ago
shaoshuo-ss / Awesome-LLM-Fingerprinting
View on GitHub
Paper list of LLM fingerprinting, based on our paper titled "SoK: Large Language Model Copyright Auditing via Fingerprinting".
☆29Aug 28, 2025Updated 10 months ago
pipilurj / G-LLaVA
View on GitHub
Official github repo of G-LLaVA
☆154Feb 20, 2025Updated last year
yueliu1999 / FlipAttack
View on GitHub
[ICML 2025] An official source code for paper "FlipAttack: Jailbreak LLMs via Flipping".
☆178May 2, 2025Updated last year
PKU-ML / PAT
View on GitHub
Code for NeurIPS 2024 Paper "Fight Back Against Jailbreaking via Prompt Adversarial Tuning"
☆22May 6, 2025Updated last year
HKUST-KnowComp / LLM-Multistep-Jailbreak
View on GitHub
Code for Findings-EMNLP 2023 paper: Multi-step Jailbreaking Privacy Attacks on ChatGPT
☆37Oct 15, 2023Updated 2 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
RobustNLP / DeRTa
View on GitHub
A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.
☆72May 22, 2025Updated last year
limenlp / safer-instruct
View on GitHub
This is the oficial repository for "Safer-Instruct: Aligning Language Models with Automated Preference Data"
☆17Feb 22, 2024Updated 2 years ago
SORRY-Bench / sorry-bench
View on GitHub
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆83Mar 1, 2025Updated last year
paul-rottger / xstest
View on GitHub
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆138Feb 24, 2025Updated last year
princeton-nlp / benign-data-breaks-safety
View on GitHub
☆47Oct 1, 2024Updated last year
ZrW00 / MuScleLoRA
View on GitHub
The code implementation of MuScleLoRA (Accepted in ACL 2024)
☆10Dec 1, 2024Updated last year
lapisrocks / rpo
View on GitHub
Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"
☆62Aug 8, 2024Updated last year