tml-epfl / llm-past-tenseView external linksLinks
Does Refusal Training in LLMs Generalize to the Past Tense? [ICLR 2025]
☆77Jan 23, 2025Updated last year
Alternatives and similar repositories for llm-past-tense
Users that are interested in llm-past-tense are comparing it to the libraries listed below
Sorting:
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]☆377Jan 23, 2025Updated last year
- Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)☆65Jan 11, 2025Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆127Feb 24, 2025Updated 11 months ago
- ☆31Sep 23, 2024Updated last year
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]☆21May 2, 2024Updated last year
- Privacy backdoors☆50Apr 28, 2024Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆173Apr 23, 2025Updated 9 months ago
- [NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs☆94Nov 17, 2024Updated last year
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆75Mar 1, 2025Updated 11 months ago
- Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation [NeurIPS 2017]☆18Apr 8, 2018Updated 7 years ago
- Bayesian scaling laws for in-context learning.☆15Mar 12, 2025Updated 11 months ago
- ACL24☆11Jun 7, 2024Updated last year
- [ICLR 2025] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates (Oral)☆84Oct 23, 2024Updated last year
- The repository contains code for Adaptive Data Optimization☆32Dec 9, 2024Updated last year
- ☆30Jun 25, 2024Updated last year
- Code for the API, workload execution, and agents underlying the LLMail-Inject Adpative Prompt Injection Challenge☆19Oct 21, 2025Updated 3 months ago
- ☆12Jul 24, 2025Updated 6 months ago
- An Empirical Study of Memorization in NLP (ACL 2022)☆13Jun 22, 2022Updated 3 years ago
- This project is concerned with my participating in the RuNNE competition https://github.com/dialogue-evaluation/RuNNE☆13Jun 28, 2023Updated 2 years ago
- [ICML 2025] Official repository for paper "OR-Bench: An Over-Refusal Benchmark for Large Language Models"☆21Mar 4, 2025Updated 11 months ago
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]☆32Jan 23, 2025Updated last year
- JailbreakBench: An Open Robustness Benchmark for Jailbreaking Language Models [NeurIPS 2024 Datasets and Benchmarks Track]☆527Apr 4, 2025Updated 10 months ago
- [ICML 2023] "Robust Weight Signatures: Gaining Robustness as Easy as Patching Weights?" by Ruisi Cai, Zhenyu Zhang, Zhangyang Wang☆16May 4, 2023Updated 2 years ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆116Jun 13, 2024Updated last year
- A fast + lightweight implementation of the GCG algorithm in PyTorch☆317May 13, 2025Updated 9 months ago
- ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors [EMNLP 2024 Findings]☆225Sep 29, 2024Updated last year
- ☆35May 21, 2025Updated 8 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆66Jun 9, 2025Updated 8 months ago
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- Collection of experiments related to small language models, mostly seq2seq models☆15Jan 9, 2026Updated last month
- Official repository for the paper Number Cookbook: Number Understanding of Language Models and How to Improve It.☆19Mar 31, 2025Updated 10 months ago
- A powerful white-box adversarial attack that exploits knowledge about the geometry of neural networks to find minimal adversarial perturb…☆12Aug 5, 2020Updated 5 years ago
- Code and data to go with the Zhu et al. paper "An Objective for Nuanced LLM Jailbreaks"☆36Dec 18, 2024Updated last year
- Code to break Llama Guard☆32Dec 7, 2023Updated 2 years ago
- [ICLR 2025] Dissecting adversarial robustness of multimodal language model agents☆123Feb 19, 2025Updated 11 months ago
- Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers☆65Aug 25, 2024Updated last year
- Repository for "StrongREJECT for Empty Jailbreaks" paper☆151Nov 3, 2024Updated last year
- An easy-to-use Python framework to generate adversarial jailbreak prompts.☆815Mar 27, 2025Updated 10 months ago
- This is the official repo for the paper "Accelerating Parallel Sampling of Diffusion Models" Tang et al. ICML 2024 https://openreview.net…☆16Jul 19, 2024Updated last year