wbopan/safety-residual-space

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/wbopan/safety-residual-space)

wbopan / safety-residual-space

Multi-dimensional analysis of orthogonal safety directions in LLM alignment

☆23

Alternatives and similar repositories for safety-residual-space

Users that are interested in safety-residual-space are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

RPC2 / AutoInject
View on GitHub
☆21Jun 12, 2026Updated last month
JoshEngels / SAE-Dark-Matter
View on GitHub
Code for our paper "Decomposing The Dark Matter of Sparse Autoencoders"
☆23Feb 6, 2025Updated last year
SproutNan / AI-Safety_SCAV
View on GitHub
This is the code repository for "Uncovering Safety Risks of Large Language Models through Concept Activation Vector"
☆49Oct 13, 2025Updated 9 months ago
CGCL-codes / Gen-AF
View on GitHub
The implementation of our IEEE S&P 2024 paper "Securely Fine-tuning Pre-trained Encoders Against Adversarial Examples".
☆11Jun 28, 2024Updated 2 years ago
Zhang-Yihao / Adversarial-Representation-Engineering
View on GitHub
Official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering.
☆20Dec 6, 2024Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
euanong / image-hijacks
View on GitHub
Official codebase for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
☆57Sep 19, 2023Updated 2 years ago
THU-KEG / SafetyNeuron
View on GitHub
Data and code for the paper: Finding Safety Neurons in Large Language Models
☆30Jan 29, 2026Updated 6 months ago
compsec-snu / pfi
View on GitHub
PFI: Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents
☆31Mar 26, 2025Updated last year
FarnoushRJ / RelP
View on GitHub
[NeurIPS 2025 MechInterp Workshop - Spotlight] Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery in La…
☆29Nov 3, 2025Updated 8 months ago
ejhshen / SLIM
View on GitHub
Implementation of SLIM, a framework of dynamics skill lifecycle management for agentic reinforcement learning
☆22May 12, 2026Updated 2 months ago
uw-nsl / SafeDecoding
View on GitHub
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
☆154Jul 19, 2024Updated 2 years ago
RylanSchaeffer / AstraFellowship-When-Do-VLM-Image-Jailbreaks-Transfer
View on GitHub
Code for ICLR 2025 Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
☆37Jun 1, 2025Updated last year
ydyjya / SafetyHeadAttribution
View on GitHub
☆70Jun 1, 2025Updated last year
VITA-Group / R-Sparse
View on GitHub
[ICLR'25] R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference
☆21Apr 28, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
inspire-group / tta_risk
View on GitHub
☆15Jun 6, 2023Updated 3 years ago
CGCL-codes / TransferAttackSurrogates
View on GitHub
The official code of IEEE S&P 2024 paper "Why Does Little Robustness Help? A Further Step Towards Understanding Adversarial Transferabili…
☆20Aug 22, 2024Updated last year
DJC-GO-SOLO / Latent-GRPO
View on GitHub
Official implementation of Latent-GRPO: reinforcement learning for vocabulary-space latent reasoning.
☆17May 12, 2026Updated 2 months ago
RUCAIBox / HADES
View on GitHub
[ECCV'24 Oral] The official GitHub page for ''Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking …
☆39Oct 23, 2024Updated last year
uw-nsl / safechain
View on GitHub
[ACL 25] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
☆30Apr 2, 2025Updated last year
jiah-li / magic
View on GitHub
The repo for paper: Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models.
☆15Dec 16, 2024Updated last year
Confirm-Solutions / flrt
View on GitHub
Fluent student-teacher redteaming
☆23Jul 25, 2024Updated 2 years ago
NY1024 / Jailbreak_GPT4o
View on GitHub
☆28Jun 5, 2024Updated 2 years ago
DyMessi / VisCRA
View on GitHub
☆19Dec 23, 2025Updated 7 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
real-absolute-AI / Unnatural_Language
View on GitHub
The official repository of 'Unnatural Language Are Not Bugs but Features for LLMs'
☆24May 20, 2025Updated last year
qizhangli / MoreBayesian-attack
View on GitHub
Code for our ICLR 2023 paper Making Substitute Models More Bayesian Can Enhance Transferability of Adversarial Examples.
☆18May 31, 2023Updated 3 years ago
Xu0615 / FinetuneCircuits
View on GitHub
A Mechanistic‑Interpretability study that finds the structural dynamics of Large Language Models under fine‑tuning.
☆17May 30, 2025Updated last year
NY1024 / BAP-Jailbreak-Vision-Language-Models-via-Bi-Modal-Adversarial-Prompt
View on GitHub
☆61Jun 5, 2024Updated 2 years ago
OSU-NLP-Group / AgentAttack
View on GitHub
☆22Oct 25, 2024Updated last year
wollschlager / geometry-of-refusal
View on GitHub
Code to the paper: The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
☆35Jul 31, 2025Updated 11 months ago
ShaoShuai0605 / Misevolution
View on GitHub
Official Repo of Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
☆90Jun 2, 2026Updated last month
thu-coai / SafeUnlearning
View on GitHub
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
☆32Jul 9, 2024Updated 2 years ago
baixianghuang / editing-attack
View on GitHub
Code and dataset for the paper: "Can Editing LLMs Inject Harm?" [AAAI'26]
☆21Dec 26, 2025Updated 7 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
LLM-MI-Research / Actionable-MI
View on GitHub
☆15Jan 20, 2026Updated 6 months ago
Tomsawyerhu / LRP4RAG
View on GitHub
RAG Hallucination Detecting By LRP.
☆12Mar 31, 2025Updated last year
wbopan / flashtrace
View on GitHub
Efficient multi-token attribution for reasoning language models — Python package, CLI, and HTML token traces
☆33Updated this week
wbbeyourself / DTE
View on GitHub
Detect-Then-Explain Framework for Text-to-SQL task
☆10Dec 6, 2023Updated 2 years ago
AI45Lab / skill-safety-bench
View on GitHub
☆29May 14, 2026Updated 2 months ago
AI45Lab / X-Boundary
View on GitHub
[EMNLP 2025] The code repo of paper "X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Com…
☆40Nov 24, 2025Updated 8 months ago
Unispac / Visual-Adversarial-Examples-Jailbreak-Large-Language-Models
View on GitHub
Repository for the Paper (AAAI 2024, Oral) --- Visual Adversarial Examples Jailbreak Large Language Models
☆282May 13, 2024Updated 2 years ago