THU-KEG/SafetyNeuron

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/THU-KEG/SafetyNeuron)

THU-KEG / SafetyNeuron

Data and code for the paper: Finding Safety Neurons in Large Language Models

☆29

Alternatives and similar repositories for SafetyNeuron

Users that are interested in SafetyNeuron are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

zhaoyiran924 / Safety-Neuron
View on GitHub
[ICLR 2025] Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron
☆33Apr 30, 2025Updated last year
Lexsi-Labs / aligntune
View on GitHub
Aligntune : A Modular Toolkit for Post Training Alignment of LLMs
☆37Updated this week
wbopan / safety-residual-space
View on GitHub
Multi-dimensional analysis of orthogonal safety directions in LLM alignment
☆22Jun 12, 2026Updated last month
thu-coai / TransferAttack
View on GitHub
[ACL 2025] Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints
☆19May 23, 2025Updated last year
homles11 / SaLoRA
View on GitHub
Code for “SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation(ICLR 2025)”
☆29Oct 23, 2025Updated 9 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
chenjianhuii / Mechanistic-Data-Attribution
View on GitHub
☆16May 25, 2026Updated last month
xypan0 / G-DIG
View on GitHub
☆12Jun 30, 2024Updated 2 years ago
ydyjya / SafetyHeadAttribution
View on GitHub
☆70Jun 1, 2025Updated last year
zhaoshiji123 / SI-Attack
View on GitHub
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
☆16Aug 6, 2025Updated 11 months ago
IINemo / llm-uncertainty-head
View on GitHub
☆26Feb 23, 2026Updated 5 months ago
ZJU-LLM-Safety / MAJIC-AAAI2026
View on GitHub
[AAAI-2026]MAJIC: Markovian Adaptive Jailbreaking. An automated black-box attack framework against LLMs that iteratively selects and fuse…
☆16Apr 1, 2026Updated 3 months ago
TrustAIRLab / HarmfulSkillBench
View on GitHub
The Official Repository for Paper "HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?"
☆15May 2, 2026Updated 2 months ago
listen0425 / Safety-Layers
View on GitHub
code space of paper "Safety Layers in Aligned Large Language Models: The Key to LLM Security" (ICLR 2025)
☆25Apr 26, 2025Updated last year
THU-KEG / DICE
View on GitHub
DICE: Detecting In-distribution Data Contamination with LLM's Internal State
☆12Sep 21, 2024Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
wollschlager / geometry-of-refusal
View on GitHub
Code to the paper: The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
☆35Jul 31, 2025Updated 11 months ago
javiferran / sae_entities
View on GitHub
☆78Mar 6, 2025Updated last year
wangyu-ovo / MML
View on GitHub
Code for the paper "Jailbreak Large Vision-Language Models Through Multi-Modal Linkage"
☆35Dec 6, 2024Updated last year
wu-lichao / NeuroStrike-Neuron-Level-Attacks-on-Aligned-LLMs
View on GitHub
☆17Jan 9, 2026Updated 6 months ago
HanNight / AdaCAD
View on GitHub
Code for NAACL 2025 paper "AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge"
☆16Mar 2, 2026Updated 4 months ago
IBM / activation-steering
View on GitHub
[ICLR 2025] General-purpose activation steering library
☆179Sep 18, 2025Updated 10 months ago
CHATS-lab / LLMs_Encode_Harmfulness_Refusal_Separately
View on GitHub
☆41Jul 3, 2026Updated 2 weeks ago
arumaekawa / DiLM
View on GitHub
Implementaiton of "DiLM: Distilling Dataset into Language Model for Text-level Dataset Distillation" (accepted by NAACL2024 Findings)".
☆28Feb 10, 2025Updated last year
XuandongZhao / pf-decoding
View on GitHub
[ICLR 2025] Permute-and-Flip: An optimally robust and watermarkable decoder for LLMs
☆19Mar 20, 2025Updated last year
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
DyMessi / VisCRA
View on GitHub
☆19Dec 23, 2025Updated 7 months ago
CHATS-lab / ToolShield
View on GitHub
[ICML 2026] Official implementation for paper "Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Ag…
☆29Jul 6, 2026Updated 2 weeks ago
liutianlin0121 / decoding-time-realignment
View on GitHub
Implementation of "Decoding-time Realignment of Language Models", ICML 2024.
☆21Jun 17, 2024Updated 2 years ago
kyleliang919 / Online-Subspace-Descent
View on GitHub
[NeurIPS 2024] Low rank memory efficient optimizer without SVD
☆33Jul 1, 2025Updated last year
TeamPigeonLab / CS-DJ
View on GitHub
Accept by CVPR 2025 (highlight)
☆25Jun 8, 2025Updated last year
JingWu321 / EraseDiff
View on GitHub
EraseDiff: Erasing Data Influence in Diffusion Models
☆14Nov 20, 2024Updated last year
howjul / note
View on GitHub
howjul's notebook
☆14Nov 15, 2024Updated last year
boyellow / AdaAD
View on GitHub
Code for the paper Boosting Accuracy and Robustness of Student Models via Adaptive Adversarial Distillation (CVPR 2023).
☆34May 26, 2023Updated 3 years ago
Dtc7w3PQ / Visco-Attack
View on GitHub
Official implementation of Visco-Attack (EMNLP 2025 Main). An open-source one-click reproduction script is also provided.
☆31Apr 11, 2026Updated 3 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
duykhuongnguyen / MAT-Steer
View on GitHub
☆21Aug 19, 2025Updated 11 months ago
Mr-OREO / CourseworkOfSE
View on GitHub
记录了在三本软工两年来的课程资料，进击吧少年
☆10Dec 10, 2022Updated 3 years ago
fmp453 / few-shot-erasing
View on GitHub
[BMVC2024] Erasing Concepts from Text-to-Image Diffusion Models with Few-shot Unlearning
☆14Updated this week
HKUST-KnowComp / LLM-Multistep-Jailbreak
View on GitHub
Code for Findings-EMNLP 2023 paper: Multi-step Jailbreaking Privacy Attacks on ChatGPT
☆37Oct 15, 2023Updated 2 years ago
NeuralSentinel / SafeInfer
View on GitHub
☆23Jan 14, 2025Updated last year
qizhangli / Gradient-based-Jailbreak-Attacks
View on GitHub
Code for our NeurIPS 2024 paper Improved Generation of Adversarial Examples Against Safety-aligned LLMs
☆12Nov 7, 2024Updated last year
THU-KEG / Xlore2.0
View on GitHub
Xlore2.0 Code[BaiduExtractor, HudongExtractor, WikiExtractor, XloreData, XloreWeb]
☆12Apr 5, 2017Updated 9 years ago