DYR1/MoGU

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/DYR1/MoGU)

DYR1 / MoGU

Our research proposes a novel MoGU framework that improves LLMs' safety while preserving their usability.

☆18

Alternatives and similar repositories for MoGU

Users that are interested in MoGU are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

vfleaking / PTST
View on GitHub
Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"
☆22Sep 21, 2025Updated 10 months ago
HenryZhen97 / Reconsidering-Overthinking
View on GitHub
Reconsidering Overthinking: Penalizing Internal and External Redundancy in CoT Reasoning
☆23Jun 23, 2026Updated 3 weeks ago
aladinD / SafeMERGE
View on GitHub
Code for SafeMERGE (ICLR 2025).
☆15Apr 1, 2025Updated last year
LLLeoLi / LARF
View on GitHub
[EMNLP 2025] Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
☆15Jul 22, 2025Updated 11 months ago
Li-Hyn / LLM_CatastrophicForgetting
View on GitHub
Code for LLM_Catastrophic_Forgetting via SAM.
☆11Jun 7, 2024Updated 2 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
weizeming / momentum-attack-llm
View on GitHub
☆25Jan 17, 2025Updated last year
XuandongZhao / weak-to-strong
View on GitHub
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆90May 2, 2025Updated last year
deeplearning-wisc / picle
View on GitHub
Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)
☆28Jun 27, 2024Updated 2 years ago
yuplin2333 / representation-space-jailbreak
View on GitHub
Code repo of our paper Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis (https://arxiv.org/abs/2406.10794…
☆24Jul 26, 2024Updated last year
PKU-Alignment / eval-anything
View on GitHub
☆22Jul 26, 2025Updated 11 months ago
git-disl / awesome_LLM-harmful-fine-tuning-papers
View on GitHub
A survey on harmful fine-tuning attack for large language model (ACM CSUR)
☆247Jun 22, 2026Updated 3 weeks ago
AI45Lab / REEF
View on GitHub
The repository of the paper "REEF: Representation Encoding Fingerprints for Large Language Models," aims to protect the IP of open-source…
☆79Jan 16, 2025Updated last year
IBM / SafeLoRA
View on GitHub
Github repo for NeurIPS 2024 paper "Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models"
☆29Dec 21, 2025Updated 7 months ago
wangrongding / folder-print
View on GitHub
🌿快速生成文件夹目录结构，支持定义目录层级，支持生成到 markdown 文件。
☆13Oct 19, 2022Updated 3 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
listen0425 / Safety-Layers
View on GitHub
code space of paper "Safety Layers in Aligned Large Language Models: The Key to LLM Security" (ICLR 2025)
☆25Apr 26, 2025Updated last year
hanshen95 / SEAL
View on GitHub
An implementation of SEAL: Safety-Enhanced Aligned LLM fine-tuning via bilevel data selection.
☆24Feb 20, 2025Updated last year
yellowtownhz / sycophancy-interpretability
View on GitHub
☆15Feb 5, 2025Updated last year
qizhangli / Gradient-based-Jailbreak-Attacks
View on GitHub
Code for our NeurIPS 2024 paper Improved Generation of Adversarial Examples Against Safety-aligned LLMs
☆12Nov 7, 2024Updated last year
homles11 / SaLoRA
View on GitHub
Code for “SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation(ICLR 2025)”
☆29Oct 23, 2025Updated 8 months ago
DanielSc4 / Dynamic-Activation-Composition
View on GitHub
Materials for "Multi-property Steering of Large Language Models with Dynamic Activation Composition"
☆14Nov 22, 2024Updated last year
haizelabs / thorn-in-haizestack
View on GitHub
Thorn in a HaizeStack test for evaluating long-context adversarial robustness.
☆26Aug 3, 2024Updated last year
XinyuHua / dyploc-acl2021
View on GitHub
Official repository for "DYPLOC: Dynamic Planning of Content Using Mixed Language Models for Opinion Text Generation"
☆10May 20, 2022Updated 4 years ago
open-compass / RePro
View on GitHub
[ICLR 2026] Rectifying LLM Thought From Lens of Optimization
☆15Dec 5, 2025Updated 7 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
git-disl / Virus
View on GitHub
This is the official code for the paper "Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation"
☆56Feb 2, 2025Updated last year
wonderNefelibata / Awesome-LRM-Safety
View on GitHub
Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as …
☆84Updated this week
LLM-Tuning-Safety / LLMs-Finetuning-Safety
View on GitHub
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…
☆358Feb 23, 2024Updated 2 years ago
princeton-polaris-lab / Evaluating-Durable-Safeguards
View on GitHub
[ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs
☆13Jun 20, 2025Updated last year
McGill-NLP / AdversarialTriggers
View on GitHub
TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models
☆19Aug 17, 2025Updated 11 months ago
rmin2000 / adv_tracing
View on GitHub
Identification of the Adversary from a Single Adversarial Example (ICML 2023)
☆10Jul 15, 2024Updated 2 years ago
SCIR-SC-Qiaoban-Team / FreeEvalLM
View on GitHub
[AAAI26] Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilitie…
☆11Feb 7, 2026Updated 5 months ago
tanganke / subspace_fusion
View on GitHub
Code for paper "Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion"
☆14Mar 28, 2024Updated 2 years ago
shadowkiller33 / Language_attack
View on GitHub
A repo for LLM jailbreak
☆14Sep 5, 2023Updated 2 years ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
oshears / adv-ml-2020-snn-project
View on GitHub
Advanced Machine Learning Fall 2020 Project Repository
☆12Dec 12, 2020Updated 5 years ago
IBM / NeuralFuse
View on GitHub
[NeurIPS'24] "NeuralFuse: Learning to Recover the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes" by Hao-Lun …
☆10Sep 18, 2025Updated 10 months ago
MaheepChaudhary / SAE-Ravel
View on GitHub
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆13Jan 26, 2025Updated last year
xcfcode / DHGN
View on GitHub
Codes for our CCL 2021 paper: Incorporating Commonsense Knowledge into Abstractive Dialogue Summarization via Heterogeneous Graph Network…
☆26Jul 28, 2021Updated 4 years ago
rain152 / PAT
View on GitHub
[NeurIPS 2024] Fight Back Against Jailbreaking via Prompt Adversarial Tuning
☆11Oct 29, 2024Updated last year
thestephencasper / benchmarking_interpretability
View on GitHub
☆35Sep 13, 2023Updated 2 years ago
Re-Align / AlignTDS
View on GitHub
Analyzing LLM Alignment via Token distribution shift
☆17Jan 26, 2024Updated 2 years ago