Does Refusal Training in LLMs Generalize to the Past Tense? [ICLR 2025]
☆78Jan 23, 2025Updated last year
Alternatives and similar repositories for llm-past-tense
Users that are interested in llm-past-tense are comparing it to the libraries listed below
Sorting:
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]☆379Jan 23, 2025Updated last year
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆65Jan 11, 2025Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆129Feb 24, 2025Updated last year
- ☆31Sep 23, 2024Updated last year
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]☆21May 2, 2024Updated last year
- Privacy backdoors☆50Apr 28, 2024Updated last year
- ☆122Feb 3, 2025Updated last year
- [ACL 25] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities☆28Apr 2, 2025Updated 11 months ago
- ☆34Nov 12, 2024Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆174Apr 23, 2025Updated 10 months ago
- [NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs☆97Nov 17, 2024Updated last year
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆76Mar 1, 2025Updated last year
- SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining (NeurIPS 2024)☆39Nov 1, 2024Updated last year
- ACL24☆11Jun 7, 2024Updated last year
- Bayesian scaling laws for in-context learning.☆15Mar 12, 2025Updated 11 months ago
- ☆26Sep 3, 2025Updated 6 months ago
- Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Apr 28, 2024Updated last year
- [ICLR 2025] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates (Oral)☆84Oct 23, 2024Updated last year
- ☆30Jun 25, 2024Updated last year
- The repository contains code for Adaptive Data Optimization☆32Dec 9, 2024Updated last year
- [ICML 2025] Official repository for paper "OR-Bench: An Over-Refusal Benchmark for Large Language Models"☆23Mar 4, 2025Updated last year
- ☆12Jul 24, 2025Updated 7 months ago
- This project is concerned with my participating in the RuNNE competition https://github.com/dialogue-evaluation/RuNNE☆13Jun 28, 2023Updated 2 years ago
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]☆32Jan 23, 2025Updated last year
- [ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion☆58Oct 1, 2025Updated 5 months ago
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]☆540Apr 4, 2025Updated 11 months ago
- [ICML 2023] "Robust Weight Signatures: Gaining Robustness as Easy as Patching Weights?" by Ruisi Cai, Zhenyu Zhang, Zhangyang Wang☆16May 4, 2023Updated 2 years ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆116Jun 13, 2024Updated last year
- ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors [EMNLP 2024 Findings]☆226Sep 29, 2024Updated last year
- ☆36May 21, 2025Updated 9 months ago
- [ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization☆29Jul 9, 2024Updated last year
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆67Jun 9, 2025Updated 8 months ago
- A powerful white-box adversarial attack that exploits knowledge about the geometry of neural networks to find minimal adversarial perturb…☆12Aug 5, 2020Updated 5 years ago
- Official repository for the paper Number Cookbook: Number Understanding of Language Models and How to Improve It.☆19Mar 31, 2025Updated 11 months ago
- Code to break Llama Guard☆32Dec 7, 2023Updated 2 years ago
- Code and data to go with the Zhu et al. paper "An Objective for Nuanced LLM Jailbreaks"☆36Dec 18, 2024Updated last year
- An easy-to-use Python framework to generate adversarial jailbreak prompts.☆820Mar 27, 2025Updated 11 months ago
- Repository for "StrongREJECT for Empty Jailbreaks" paper☆152Nov 3, 2024Updated last year
- [ICLR 2025] Dissecting adversarial robustness of multimodal language model agents☆130Feb 19, 2025Updated last year