NLie2 / what_features_jailbreak_LLMsView external linksLinks
☆18Mar 30, 2025Updated 10 months ago
Alternatives and similar repositories for what_features_jailbreak_LLMs
Users that are interested in what_features_jailbreak_LLMs are comparing it to the libraries listed below
Sorting:
- ☆23Jun 13, 2024Updated last year
- [ICLR 2025] BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks☆30Nov 2, 2025Updated 3 months ago
- Q&A dataset for many-shot jailbreaking☆14Jul 19, 2024Updated last year
- [COLING 2025] Official repo of paper: "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jail…☆12Jul 26, 2024Updated last year
- A repo for LLM jailbreak☆14Sep 5, 2023Updated 2 years ago
- Tools for optimizing steering vectors in LLMs.☆19Apr 10, 2025Updated 10 months ago
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆340Jun 13, 2025Updated 8 months ago
- Research on "Many-Shot Jailbreaking" in Large Language Models (LLMs). It unveils a novel technique capable of bypassing the safety mechan…☆16Aug 6, 2024Updated last year
- Repository of Streaming LLMs☆30Feb 5, 2026Updated last week
- Kim, J., Evans, J., & Schein, A. (2025). Linear Representations of Political Perspective Emerge in Large Language Models. ICLR.☆24Mar 27, 2025Updated 10 months ago
- Synthesizing realistic and diverse text-datasets from augmented LLMs☆16Jan 26, 2026Updated 2 weeks ago
- [ICML 2025] An official source code for paper "FlipAttack: Jailbreak LLMs via Flipping".☆163May 2, 2025Updated 9 months ago
- The code for paper "EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning"☆37Oct 1, 2025Updated 4 months ago
- The most comprehensive and accurate LLM jailbreak attack benchmark by far☆22Mar 22, 2025Updated 10 months ago
- Exploring Model Kinship for Merging Large Language Models☆27Apr 16, 2025Updated 9 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆85Mar 7, 2025Updated 11 months ago
- The official repository for guided jailbreak benchmark☆28Jul 28, 2025Updated 6 months ago
- Code repo of our paper Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis (https://arxiv.org/abs/2406.10794…☆23Jul 26, 2024Updated last year
- [ICML 2025] Weak-to-Strong Jailbreaking on Large Language Models☆92May 2, 2025Updated 9 months ago
- ☆22Dec 17, 2024Updated last year
- Fluent student-teacher redteaming☆23Jul 25, 2024Updated last year
- ☆27Oct 6, 2024Updated last year
- Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)☆162Nov 30, 2024Updated last year
- ☆71Mar 30, 2025Updated 10 months ago
- Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers☆65Aug 25, 2024Updated last year
- ☆27Oct 22, 2024Updated last year
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆163Jun 25, 2025Updated 7 months ago
- This is the official repository for the ICLR 2025 accepted paper Badrobot: Manipulating Embodied LLMs in the Physical World.☆41Jun 26, 2025Updated 7 months ago
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆65Jan 11, 2025Updated last year
- TAP: An automated jailbreaking method for black-box LLMs☆219Dec 10, 2024Updated last year
- Irolyn is a jailbreak repo extractor for iOS 18 to iOS 18.5 and iPadOS 18 to iPadOS 18.5 .☆12May 15, 2025Updated 8 months ago
- Auditing agents for fine-tuning safety☆18Oct 21, 2025Updated 3 months ago
- The NDIF server, which performs deep inference and serves nnsight requests remotely☆41Updated this week
- ECSO (Make MLLM safe without neither training nor any external models!) (https://arxiv.org/abs/2403.09572)☆36Nov 2, 2024Updated last year
- [ICLR 2025] Official implementation for "SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanati…☆42Feb 11, 2025Updated last year
- Code and data to go with the Zhu et al. paper "An Objective for Nuanced LLM Jailbreaks"☆36Dec 18, 2024Updated last year
- ☆33Jun 24, 2024Updated last year
- Open Source Replication of Anthropic's Alignment Faking Paper☆54Apr 4, 2025Updated 10 months ago
- ☆58Oct 4, 2025Updated 4 months ago