zhaoyiran924/Safety-Neuron

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/zhaoyiran924/Safety-Neuron)

zhaoyiran924 / Safety-Neuron

[ICLR 2025] Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

☆33

Alternatives and similar repositories for Safety-Neuron

Users that are interested in Safety-Neuron are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

chenyuxin1999 / Abstract_Thought
View on GitHub
[NeurIPS 2025] The implementation of paper "The Emergence of Abstract Thought in Large Language Models Beyond Any Language"
☆19Jun 9, 2025Updated last year
bpwu1 / confidence-regulation-neurons
View on GitHub
Confidence Regulation Neurons in Language Models (NeurIPS 2024)
☆15Feb 1, 2025Updated last year
CHATS-lab / LLMs_Encode_Harmfulness_Refusal_Separately
View on GitHub
☆41Jul 3, 2026Updated 3 weeks ago
DAMO-NLP-SG / multilingual_analysis
View on GitHub
[NeurIPS 2024] How do Large Language Models Handle Multilingualism?
☆52Nov 8, 2024Updated last year
zepingyu0512 / awesome-LLM-neuron
View on GitHub
☆36Jun 13, 2025Updated last year
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
aladinD / SafeMERGE
View on GitHub
Code for SafeMERGE (ICLR 2025).
☆15Apr 1, 2025Updated last year
THU-KEG / SafetyNeuron
View on GitHub
Data and code for the paper: Finding Safety Neurons in Large Language Models
☆30Jan 29, 2026Updated 6 months ago
taidopurason / tokenizer-extension
View on GitHub
☆15Dec 4, 2025Updated 7 months ago
real-absolute-AI / Unnatural_Language
View on GitHub
The official repository of 'Unnatural Language Are Not Bugs but Features for LLMs'
☆24May 20, 2025Updated last year
Les1a / SoftTokenForMaskedDLM
View on GitHub
Introduce a continuous intermediate representation between "masks" and "tokens" for dLLM
☆15Dec 1, 2025Updated 7 months ago
togethercomputer / xorl
View on GitHub
XoRL
☆17Updated this week
ColinLu50 / SafeDelta
View on GitHub
The official code repo for "Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets" in ICML 2025.
☆59Feb 12, 2026Updated 5 months ago
thu-coai / TransferAttack
View on GitHub
[ACL 2025] Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints
☆19May 23, 2025Updated last year
ASTRAL-Group / MonitorBench
View on GitHub
[COLM 2026] Official implementation for "MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Mo…
☆20Apr 23, 2026Updated 3 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
javiferran / sae_entities
View on GitHub
☆78Mar 6, 2025Updated last year
brendel-group / clip-ood
View on GitHub
Official code for the paper "Does CLIP's Generalization Performance Mainly Stem from High Train-Test Similarity?" (ICLR 2024)
☆11Aug 26, 2024Updated last year
apartresearch / DarkBench
View on GitHub
Benchmarking Dark Patterns in LLMs (ICLR 2025)
☆18Mar 29, 2025Updated last year
FarinaMatteo / qmmf
View on GitHub
[CVPR '23 Highlight] Official repository for the paper "Quantum Multi-Model Fitting".
☆11Mar 7, 2025Updated last year
wzhuang-xmu / LoSA
View on GitHub
[ICLR 2025] Official implementation of paper "Dynamic Low-Rank Sparse Adaptation for Large Language Models".
☆25Mar 16, 2025Updated last year
GATECH-EIC / NeRFool
View on GitHub
[ICML 2023] "NeRFool: Uncovering the Vulnerability of Generalizable Neural Radiance Fields against Adversarial Perturbations" by Yonggan …
☆19Mar 10, 2024Updated 2 years ago
boyiwei / alignment-attribution-code
View on GitHub
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆91Mar 30, 2025Updated last year
thunxxx / MLLM-Jailbreak-evaluation-MMJ-Bench
View on GitHub
☆80Mar 30, 2025Updated last year
ydyjya / SafetyHeadAttribution
View on GitHub
☆70Jun 1, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
yihuaihong / ConceptVectors
View on GitHub
[EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆40Aug 20, 2025Updated 11 months ago
DAMO-NLP-SG / AdamergeX
View on GitHub
☆11Apr 2, 2024Updated 2 years ago
Astarojth / AgentAuditor-ASSEBench
View on GitHub
☆40May 29, 2026Updated 2 months ago
CHATS-lab / ToolShield
View on GitHub
[ICML 2026] Official implementation for paper "Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Ag…
☆29Jul 6, 2026Updated 3 weeks ago
andyrdt / refusal_direction
View on GitHub
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆424Jun 13, 2025Updated last year
biomed-AI / LMetalSite
View on GitHub
LMetalSite: alignment-free metal ion-binding site prediction from protein sequence through pretrained language model and multi-task learn…
☆19Oct 23, 2025Updated 9 months ago
QingyuLiu / Agentic-Upward-Deception
View on GitHub
This repo is the official implementation of “Are Your Agents Upward Deceivers?”. The paper is accepted by ICML 2026.
☆24Dec 15, 2025Updated 7 months ago
med-air / ClipGS
View on GitHub
[MICCAI'25] ClipGS: Clippable Gaussian Splatting for Interactive Cinematic Visualization of Volumetric Medical Data
☆15Jul 28, 2025Updated last year
czg1225 / VeriThinker
View on GitHub
[NeurIPS 2025] VeriThinker: Learning to Verify Makes Reasoning Model Efficient
☆67Sep 27, 2025Updated 10 months ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
ASTRAL-Group / ASTRA
View on GitHub
[CVPR 2025] Official implementation for "Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbre…
☆62Jul 5, 2025Updated last year
tdemin16 / multi-lane
View on GitHub
Official Implementation of MULTI-LANE (Multi Label class incremental learning via summarising pAtch tokeN Embeddings). Published in 3rd C…
☆15Feb 20, 2025Updated last year
MingyuJ666 / Rope_with_LLM
View on GitHub
[ICML'25] Our study systematically investigates massive values in LLMs' attention mechanisms. First, we observe massive values are concen…
☆87Jun 20, 2025Updated last year
nji3 / PCA_Autoencoder_FisherFace
View on GitHub
Using PCA, Autoencoder and Fisher linear discriminant to extract the effective representations from the face images. Do the reconstructio…
☆12Apr 23, 2019Updated 7 years ago
mainlp / Multilingual-Refusal
View on GitHub
☆16Nov 5, 2025Updated 8 months ago
isXinLiu / MM-SafetyBench
View on GitHub
Accepted by ECCV 2024
☆218Oct 15, 2024Updated last year
noagarcia / phase
View on GitHub
PHASE annotations for societal bias in vision-and-language tasks.
☆18Jun 18, 2024Updated 2 years ago