CHATS-lab/LLMs_Encode_Harmfulness_Refusal_Separately

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/CHATS-lab/LLMs_Encode_Harmfulness_Refusal_Separately)

CHATS-lab / LLMs_Encode_Harmfulness_Refusal_Separately

☆41

Alternatives and similar repositories for LLMs_Encode_Harmfulness_Refusal_Separately

Users that are interested in LLMs_Encode_Harmfulness_Refusal_Separately are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

mainlp / Multilingual-Refusal
View on GitHub
☆16Nov 5, 2025Updated 8 months ago
youngwanLEE / holisafe
View on GitHub
[CVPR Findings 2026] HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
☆17Mar 8, 2026Updated 4 months ago
ydyjya / SafetyHeadAttribution
View on GitHub
☆70Jun 1, 2025Updated last year
aladinD / SafeMERGE
View on GitHub
Code for SafeMERGE (ICLR 2025).
☆15Apr 1, 2025Updated last year
andyrdt / refusal_direction
View on GitHub
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆423Jun 13, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
IBM / SafeLoRA
View on GitHub
Github repo for NeurIPS 2024 paper "Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models"
☆29Dec 21, 2025Updated 7 months ago
sarendis56 / Jailbreak_Detection_RCS
View on GitHub
Official Codebase of the ACL 2026 Oral paper "Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contra…
☆26Jun 25, 2026Updated 3 weeks ago
BKHMSI / cultural-trends
View on GitHub
Investigating Cultural Alignment of Large Language Models
☆13Aug 14, 2024Updated last year
ShenzheZhu / JailDAM
View on GitHub
[COLM 2025] JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
☆26Nov 25, 2025Updated 7 months ago
CHATS-lab / ToolShield
View on GitHub
[ICML 2026] Official implementation for paper "Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Ag…
☆29Jul 6, 2026Updated 2 weeks ago
thu-coai / LongSafety
View on GitHub
[ACL 2025] LongSafety: Evaluating Long-Context Safety of Large Language Models
☆16Jun 18, 2025Updated last year
EchoSafe-MLLM / EchoSafe
View on GitHub
[CVPR 2026] Code for Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory
☆15Mar 18, 2026Updated 4 months ago
leigest519 / HiddenDetect
View on GitHub
ACL 2025 (Main) HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States
☆165Jun 8, 2025Updated last year
Unispac / shallow-vs-deep-alignment
View on GitHub
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆187Apr 23, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
jorispos / ConceptorSteering
View on GitHub
☆16Mar 13, 2025Updated last year
TianyunYoung / Hallucination-Attribution
View on GitHub
This repo contains the code for the paper "Understanding and Mitigating Hallucinations in Large Vision-Language Models via Modular Attrib…
☆39Jul 14, 2025Updated last year
PandragonXIII / CIDER
View on GitHub
This is the official repository for Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models.
☆15Jan 16, 2025Updated last year
listen0425 / Safety-Layers
View on GitHub
code space of paper "Safety Layers in Aligned Large Language Models: The Key to LLM Security" (ICLR 2025)
☆25Apr 26, 2025Updated last year
bpwu1 / confidence-regulation-neurons
View on GitHub
Confidence Regulation Neurons in Language Models (NeurIPS 2024)
☆15Feb 1, 2025Updated last year
HanjiangHu / NBF-LLM
View on GitHub
The official code for "Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks".
☆18Jun 24, 2026Updated 3 weeks ago
montehoover / DynaGuard
View on GitHub
Code for "DynaGuard: A Dynamic Guardrail Model With User-Defined Policies."
☆23Nov 3, 2025Updated 8 months ago
DripNowhy / ETA
View on GitHub
[ICLR 2025] PyTorch Implementation of "ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time"
☆34Jul 20, 2025Updated last year
thunxxx / MLLM-Jailbreak-evaluation-MMJ-Bench
View on GitHub
☆81Mar 30, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
wangyu-ovo / MML
View on GitHub
Code for the paper "Jailbreak Large Vision-Language Models Through Multi-Modal Linkage"
☆35Dec 6, 2024Updated last year
AI45Lab / VLSBench
View on GitHub
[ACL 2025] Data and Code for Paper VLSBench: Unveiling Visual Leakage in Multimodal Safety
☆62Jul 21, 2025Updated last year
wbopan / safety-residual-space
View on GitHub
Multi-dimensional analysis of orthogonal safety directions in LLM alignment
☆22Jun 12, 2026Updated last month
taidopurason / tokenizer-extension
View on GitHub
☆15Dec 4, 2025Updated 7 months ago
rgreenblatt / control-evaluations
View on GitHub
☆25May 25, 2024Updated 2 years ago
rezashkv / diffusion_pruning
View on GitHub
[ICLR 2025] Adaptive prompt tailored pruning of T2I diffusion models.
☆15Feb 1, 2025Updated last year
davidbau / sidn-handbook
View on GitHub
The Structure and Interpretation of Deep Networks Handbook
☆14Dec 14, 2024Updated last year
AI-secure / AdvAgent
View on GitHub
☆25May 28, 2025Updated last year
declare-lab / safety-arithmetic
View on GitHub
☆13Jan 14, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
seanie12 / ThinkSafe
View on GitHub
☆19May 4, 2026Updated 2 months ago
jinghan1he / VHR
View on GitHub
[ACL 2025] Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence
☆21Jun 10, 2025Updated last year
wollschlager / geometry-of-refusal
View on GitHub
Code to the paper: The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
☆35Jul 31, 2025Updated 11 months ago
naver-ai / JOOD
View on GitHub
[CVPR 2025] Official implementation for JOOD "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy"
☆21Jun 11, 2025Updated last year
ml-research / Q16
View on GitHub
☆35May 22, 2024Updated 2 years ago
QingyuLiu / Agentic-Upward-Deception
View on GitHub
This repo is the official implementation of “Are Your Agents Upward Deceivers?”. The paper is accepted by ICML 2026.
☆24Dec 15, 2025Updated 7 months ago
zchoi / SPT
View on GitHub
[TCSVT23] Official code for "SPT: Spatial Pyramid Transformer for Image Captioning".
☆10Aug 14, 2024Updated last year