yihedeng9/DuoGuard

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/yihedeng9/DuoGuard)

yihedeng9 / DuoGuard

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

☆34

Alternatives and similar repositories for DuoGuard

Users that are interested in DuoGuard are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

ZNLP / Language-Imbalance-Driven-Rewarding
View on GitHub
[ICLR 2025] Language Imbalance Driven Rewarding for Multilingual Self-improving
☆25Apr 6, 2026Updated 3 months ago
uclaml / PDE
View on GitHub
Official repo of Progressive Data Expansion: data, code and evaluation
☆29Nov 16, 2023Updated 2 years ago
beanie00 / self-distillation-analysis
View on GitHub
Codebase for the work “Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?”
☆74Apr 14, 2026Updated 3 months ago
Hanpx20 / SafeSwitch
View on GitHub
Official code repository for the paper "Internal Activation as the Polar Star for Steering Unsafe LLM Behavior"
☆15May 31, 2026Updated last month
mandyyyyii / east
View on GitHub
☆19Aug 4, 2025Updated 11 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
VITA-Group / SEAL
View on GitHub
[COLM 2025] SEAL: Steerable Reasoning Calibration of Large Language Models for Free
☆60Apr 6, 2025Updated last year
Vinsonzyh / BlueSuffix
View on GitHub
[ICLR 2025] BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
☆31Nov 2, 2025Updated 8 months ago
YuYang0901 / EPIC
View on GitHub
Not All Poisons are Created Equal: Robust Training against Data Poisoning (ICML 2022)
☆22Aug 8, 2022Updated 3 years ago
BaohaoLiao / SAGE
View on GitHub
Self-Hinting Language Models Enhance Reinforcement Learning
☆26Mar 28, 2026Updated 3 months ago
SaFo-Lab / ReasoningBomb
View on GitHub
[CCS 2026] The official implementation of our CCS 2026 paper "ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathological…
☆15Jun 24, 2026Updated 3 weeks ago
paul-rottger / msts-multimodal-safety
View on GitHub
Röttger et al. (2025): "MSTS: A Multimodal Safety Test Suite for Vision-Language Models"
☆20Mar 31, 2025Updated last year
YuYang0901 / CLIP-spurious-finetune
View on GitHub
Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning (ICML 2023)
☆19Dec 15, 2023Updated 2 years ago
yinyueqin / relative-preference-optimization
View on GitHub
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts
☆26Feb 23, 2024Updated 2 years ago
yecchen / MIRAI
View on GitHub
Code and Data for "MIRAI: Evaluating LLM Agents for Event Forecasting"
☆111Jul 2, 2024Updated 2 years ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
YiyiyiZhao / siren
View on GitHub
Welcome to the official repository for Siren, a project aimed at understanding and mitigating harmful behaviors in large language models …
☆15Jun 14, 2026Updated last month
TemporaryLoRA / FreeLM
View on GitHub
☆15Feb 10, 2026Updated 5 months ago
sail-sg / Stable-RL
View on GitHub
Rethinking the Trust Region in LLM Reinforcement Learning
☆62Mar 2, 2026Updated 4 months ago
ADaM-BJTU / OpenRFT
View on GitHub
OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning
☆157Dec 24, 2024Updated last year
sail-sg / imperceptible-jailbreaks
View on GitHub
[ArXiv 2025] Imperceptible Jailbreaking against Large Language Models
☆25Oct 7, 2025Updated 9 months ago
hanningzhang / ER-PRM
View on GitHub
☆20Dec 14, 2024Updated last year
WangCheng0116 / Awesome-LRMs-Safety
View on GitHub
Official repository for "Safety in Large Reasoning Models: A Survey" - Exploring safety risks, attacks, and defenses for Large Reasoning …
☆90Aug 25, 2025Updated 10 months ago
Infini-AI-Lab / M2PO
View on GitHub
☆32Oct 8, 2025Updated 9 months ago
liangyupu / DIMTDA
View on GitHub
The official repository of "Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling"
☆14Nov 26, 2025Updated 7 months ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
cwang621 / blsp
View on GitHub
BLSP: Bootstrapping Langauge-Speech Pre-training via Behavior Alignment of Continuation Writing
☆59Mar 11, 2024Updated 2 years ago
DSN-2024 / DSN
View on GitHub
DSN jailbreak Attack & Evaluation Ensemble
☆17Feb 7, 2026Updated 5 months ago
HenryCai11 / LLM-Self-Control
View on GitHub
The official repo of paper "Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller"
☆18Aug 13, 2024Updated last year
scaleapi / mrt
View on GitHub
https://scale.com/research/mrt
☆20Mar 16, 2026Updated 4 months ago
wicai24 / DOOR-Alignment
View on GitHub
☆20Apr 7, 2025Updated last year
yihedeng9 / rlhf-summary-notes
View on GitHub
A brief and partial summary of RLHF algorithms.
☆152Mar 4, 2025Updated last year
chai-research / lmgym
View on GitHub
Code base for internal reward models and PPO training
☆24Oct 1, 2023Updated 2 years ago
YorkUCVIL / UniversalSAE
View on GitHub
Code base for Universal Sparse Autoencoders (USAEs)
☆21Sep 7, 2025Updated 10 months ago
HanjiangHu / NBF-LLM
View on GitHub
The official code for "Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks".
☆18Jun 24, 2026Updated 3 weeks ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
AntNLP / nope_head_scale
View on GitHub
☆29May 4, 2024Updated 2 years ago
hartvigsen-group / composable-interventions
View on GitHub
☆29Feb 27, 2025Updated last year
grasses / PoisonPrompt
View on GitHub
Code for paper: PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models, IEEE ICASSP 2024. Demo//124.220.228.133:11107
☆21Aug 10, 2024Updated last year
xydaytoy / EVA
View on GitHub
☆14Apr 16, 2024Updated 2 years ago
AI45Lab / VLSBench
View on GitHub
[ACL 2025] Data and Code for Paper VLSBench: Unveiling Visual Leakage in Multimodal Safety
☆62Jul 21, 2025Updated last year
Dtc7w3PQ / Response-Attack
View on GitHub
Official implementation of “Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models” (AAAI 2026).
☆37Mar 22, 2026Updated 4 months ago
vertaix / Alternators
View on GitHub
This repository contains the implementation of **Alternators**, a novel family of generative models for time-dependent data.
☆35Jun 6, 2025Updated last year