[EMNLP 2025] Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
☆12Aug 22, 2025Updated 6 months ago
Alternatives and similar repositories for Reasoning-to-Defend
Users that are interested in Reasoning-to-Defend are comparing it to the libraries listed below
Sorting:
- ☆24Feb 17, 2026Updated 2 weeks ago
- Code repo of our paper Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis (https://arxiv.org/abs/2406.10794…☆23Jul 26, 2024Updated last year
- [AAAI'26 Oral] Official Implementation of STAR-1: Safer Alignment of Reasoning LLMs with 1K Data☆33Apr 7, 2025Updated 10 months ago
- ☆39May 17, 2025Updated 9 months ago
- [COLM 2024] JailBreakV-28K: A comprehensive benchmark designed to evaluate the transferability of LLM jailbreak attacks to MLLMs, and fur…☆88May 9, 2025Updated 9 months ago
- AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM☆83Nov 3, 2024Updated last year
- ☆33Jun 24, 2024Updated last year
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆82Updated this week
- A Framework for Evaluating AI Agent Safety in Realistic Environments☆30Oct 2, 2025Updated 5 months ago
- ☆14Aug 7, 2025Updated 6 months ago
- (ACL 2025 Main) Distilling RAG for SLMs from LLMs to Transfer Knowledge and Mitigate Hallucination via Evidence and Graph-based Distillat…☆33Aug 23, 2025Updated 6 months ago
- Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM☆39Jan 17, 2025Updated last year
- This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RL☆61Sep 5, 2025Updated 6 months ago
- [AAAI 2025] Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks☆11Jun 19, 2025Updated 8 months ago
- Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning☆30Sep 29, 2025Updated 5 months ago
- [IEEE TIP] Offical implementation for the work "BadCM: Invisible Backdoor Attack against Cross-Modal Learning".☆14Aug 30, 2024Updated last year
- Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups☆51Dec 23, 2024Updated last year
- MediaPipeを用いたハンドジェスチャーによる簡単なマウス操作を行うプログラムです。☆12Mar 17, 2021Updated 4 years ago
- ☆46Jul 14, 2024Updated last year
- ☆16Mar 17, 2025Updated 11 months ago
- Code for Fast Propagation is Better: Accelerating Single-Step Adversarial Training via Sampling Subnetworks (TIFS2024)☆13Mar 29, 2024Updated last year
- ☆20Feb 3, 2025Updated last year
- An Android WebView with full screen video☆10Aug 17, 2017Updated 8 years ago
- Prompt Generator model for Stable Diffusion Models☆11Jun 20, 2023Updated 2 years ago
- todo: desc☆11Aug 12, 2021Updated 4 years ago
- Adversarial Attack for Pre-trained Code Models☆10Jul 19, 2022Updated 3 years ago
- The repo for paper: Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models.☆13Dec 16, 2024Updated last year
- Source codes for the paper "Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning" (PDMER) which p…☆14Mar 24, 2025Updated 11 months ago
- YOLOv11-pruning based on constraint of BN layer gamma values.☆22Jan 17, 2025Updated last year
- Official repository for the paper, "FedMABench: Benchmarking Mobile GUI Agents on Decentralized Heterogeneous User Data", EMNLP 2025 Main…☆15Nov 11, 2025Updated 3 months ago
- This is the official GitHub repository for our survey paper "Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language …☆174May 14, 2025Updated 9 months ago
- [ACL 2025] Data and Code for Paper VLSBench: Unveiling Visual Leakage in Multimodal Safety☆54Jul 21, 2025Updated 7 months ago
- Chinese Mammography Database (CMMD dataset) Deep Learning Classification Pipeline☆15Mar 15, 2022Updated 3 years ago
- ☆11Apr 3, 2024Updated last year
- ☆58Aug 11, 2024Updated last year
- 数据库实践课设:利用C#和SQL-Server实现简易的选课系统☆10Oct 11, 2020Updated 5 years ago
- The release version of OMCmf code for paper "One-pass Multi-view Clustering with Matrix Factorization"☆13Nov 30, 2021Updated 4 years ago
- ☆12Aug 2, 2021Updated 4 years ago
- The first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods!☆14Apr 8, 2025Updated 10 months ago