Our research proposes a novel MoGU framework that improves LLMs' safety while preserving their usability.
☆18Jan 14, 2025Updated last year
Alternatives and similar repositories for MoGU
Users that are interested in MoGU are comparing it to the libraries listed below
Sorting:
- ☆23Jan 17, 2025Updated last year
- Code for LLM_Catastrophic_Forgetting via SAM.☆11Jun 7, 2024Updated last year
- TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models☆19Aug 17, 2025Updated 6 months ago
- ☆21Jul 26, 2025Updated 7 months ago
- [ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models☆90May 2, 2025Updated 10 months ago
- Code repo of our paper Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis (https://arxiv.org/abs/2406.10794…☆23Jul 26, 2024Updated last year
- Thorn in a HaizeStack test for evaluating long-context adversarial robustness.☆26Aug 3, 2024Updated last year
- Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)☆26Jun 27, 2024Updated last year
- Multi-Layer Sparse Autoencoders (ICLR 2025)☆29Feb 6, 2026Updated 3 weeks ago
- The repository of the paper "REEF: Representation Encoding Fingerprints for Large Language Models," aims to protect the IP of open-source…☆74Jan 16, 2025Updated last year
- [NeurIPS 2024] Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling☆33Nov 8, 2024Updated last year
- A survey on harmful fine-tuning attack for large language model☆232Updated this week
- Improved techniques for optimization-based jailbreaking on large language models (ICLR2025)☆142Apr 7, 2025Updated 10 months ago
- ☆35Sep 13, 2023Updated 2 years ago
- Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …☆82Updated this week
- Code for "APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training"☆38Dec 23, 2025Updated 2 months ago
- ☆14Feb 5, 2025Updated last year
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆341Feb 23, 2024Updated 2 years ago
- ☆33Mar 13, 2025Updated 11 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆151Jul 19, 2024Updated last year
- This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RL☆61Sep 5, 2025Updated 5 months ago
- ☆20Aug 8, 2025Updated 6 months ago
- Identification of the Adversary from a Single Adversarial Example (ICML 2023)☆10Jul 15, 2024Updated last year
- Official repository for "DYPLOC: Dynamic Planning of Content Using Mixed Language Models for Opinion Text Generation"☆10May 20, 2022Updated 3 years ago
- ☆44Oct 1, 2024Updated last year
- ☆16Jul 29, 2025Updated 7 months ago
- [AAAI26] Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilitie…☆10Feb 7, 2026Updated 3 weeks ago
- Win + D for One Monitor (Show Desktop only for One Monitor)☆10Dec 15, 2022Updated 3 years ago
- Python implementation of Gnutella for CS 114 P2P systems☆12Mar 23, 2012Updated 13 years ago
- Homeworks, Midterm, & Capstone from ML BookCamp☆16Jan 28, 2022Updated 4 years ago
- Tool to decrypt encrypted strings in AgentTesla☆16Jan 24, 2022Updated 4 years ago
- Official Repository for Can Language Models be Instructed to Protect Personal Information?☆13Oct 8, 2023Updated 2 years ago
- [WSDM 2026] LookAhead Tuning: Safer Language Models via Partial Answer Previews☆17Dec 14, 2025Updated 2 months ago
- Code for paper "Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion"☆14Mar 28, 2024Updated last year
- [NeurIPS 2025@FoRLM] R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search☆17Jan 24, 2026Updated last month
- Personality Alignment of Language Models☆53Jul 1, 2025Updated 8 months ago
- This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)☆49Jan 15, 2026Updated last month
- A versatile and powerful data platform allowing interactive searches, dashboards, alerts, and more.☆26Sep 12, 2025Updated 5 months ago
- Unofficail pytorch implementation of BigBiGAN☆11Mar 26, 2021Updated 4 years ago