xbmxb / EnvDistractionLinks

☆22

Alternatives and similar repositories for EnvDistraction

Users that are interested in EnvDistraction are comparing it to the libraries listed below

Sorting:

Lordog / R-Judge
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)
☆91Updated 5 months ago
OSU-NLP-Group / AgentSafety
☆119Updated 5 months ago
ChenWu98 / agent-attack
[ICLR 2025] Dissecting adversarial robustness of multimodal language model agents
☆110Updated 8 months ago
DYR1 / MoGU
Our research proposes a novel MoGU framework that improves LLMs' safety while preserving their usability.
☆18Updated 9 months ago
T1aNS1R / Evil-Geniuses
☆68Updated last year
YihanWang617 / LLM-Jailbreaking-Defense-Backtranslation
Code for paper "Defending aginast LLM Jailbreaking via Backtranslation"
☆30Updated last year
chujiezheng / LLM-Safeguard
Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"
☆98Updated 5 months ago
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆99Updated last year
AI4Good24 / PsySafe
☆49Updated 8 months ago
PKU-Alignment / llms-resist-alignment
[ACL2025 Best Paper] Language Models Resist Alignment
☆33Updated 4 months ago
SALT-NLP / PopupAttack
Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups
☆44Updated 10 months ago
Jometeorie / KnowledgeSpread
☆34Updated last year
OpenSafetyLab / SALAD-BENCH
【ACL 2024】 SALAD benchmark & MD-Judge
☆163Updated 7 months ago
princeton-nlp / benign-data-breaks-safety
☆41Updated last year
Improbable-AI / curiosity_redteam
Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…
☆83Updated last year
PKU-Alignment / aligner
[NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct
☆188Updated 9 months ago
agiresearch / TrustAgent
TrustAgent: Towards Safe and Trustworthy LLM-based Agents
☆53Updated 8 months ago
zhaoyiran924 / Probe-Sampling
[NeurIPS 2024] Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling
☆30Updated 11 months ago
zhxieml / remiss-jailbreak
☆32Updated last year
DAMO-NLP-SG / multilingual-safety-for-LLMs
[ICLR 2024]Data for "Multilingual Jailbreak Challenges in Large Language Models"
☆89Updated last year
Yifan-Song793 / ETO
Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)
☆151Updated 11 months ago
sail-sg / I-FSJ
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆65Updated 9 months ago
ydyjya / LLM-IHS-Explanation
☆53Updated last year
KbsdJames / omni-math-rule
The rule-based evaluation subset and code implementation of Omni-MATH
☆23Updated 10 months ago
PKU-Alignment / beavertails
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
☆163Updated 2 years ago
OSU-NLP-Group / AgentAttack
☆22Updated last year
thu-coai / Agent-SafetyBench
☆64Updated 2 months ago
ybwang119 / Awesome-reasoning-safety
This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RL
☆46Updated last month
XuandongZhao / weak-to-strong
[ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models
☆86Updated 5 months ago
SORRY-Bench / sorry-bench
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆62Updated 7 months ago