Code for safety test in "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates"
☆22Sep 21, 2025Updated 5 months ago
Alternatives and similar repositories for PTST
Users that are interested in PTST are comparing it to the libraries listed below
Sorting:
- Our research proposes a novel MoGU framework that improves LLMs' safety while preserving their usability.☆18Jan 14, 2025Updated last year
- ☆44Oct 1, 2024Updated last year
- This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)☆25Sep 10, 2024Updated last year
- ☆19May 14, 2025Updated 9 months ago
- Fine-tuning-free Shapley value (FreeShap) for instance attribution☆14May 29, 2024Updated last year
- ☆13Aug 9, 2023Updated 2 years ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆66Jun 9, 2025Updated 8 months ago
- ☆60Mar 9, 2023Updated 2 years ago
- code space of paper "Safety Layers in Aligned Large Language Models: The Key to LLM Security" (ICLR 2025)☆21Apr 26, 2025Updated 10 months ago
- We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20…☆339Feb 23, 2024Updated 2 years ago
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]☆21May 2, 2024Updated last year
- ☆23Jan 17, 2025Updated last year
- Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-tur…☆23Dec 3, 2024Updated last year
- "In-Context Unlearning: Language Models as Few Shot Unlearners". Martin Pawelczyk, Seth Neel* and Himabindu Lakkaraju*; ICML 2024.☆29Oct 18, 2023Updated 2 years ago
- Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"☆66Apr 24, 2024Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆258Sep 24, 2024Updated last year
- The official implement of paper "Does Federated Learning Really Need Backpropagation?"☆23Feb 9, 2023Updated 3 years ago
- This is the repository that introduces research topics related to protecting intellectual property (IP) of AI from a data-centric perspec…☆23Oct 30, 2023Updated 2 years ago
- Official repository for "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks"☆61Aug 8, 2024Updated last year
- Multi-Layer Sparse Autoencoders (ICLR 2025)☆29Feb 6, 2026Updated 3 weeks ago
- ☆32Feb 11, 2025Updated last year
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]☆32Jan 23, 2025Updated last year
- In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)☆62Mar 30, 2024Updated last year
- [NeurIPS 2024] Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling☆33Nov 8, 2024Updated last year
- One stop-shop for matplotlib based visualizations☆10Jun 9, 2025Updated 8 months ago
- ☆14Oct 23, 2025Updated 4 months ago
- AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM☆83Nov 3, 2024Updated last year
- Code for paper "Defending aginast LLM Jailbreaking via Backtranslation"☆34Aug 16, 2024Updated last year
- ☆35Sep 13, 2023Updated 2 years ago
- [ACL 2024] The official codebase for the paper "Self-Distillation Bridges Distribution Gap in Language Model Fine-tuning".☆157Nov 2, 2024Updated last year
- DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)☆79Oct 3, 2024Updated last year
- ☆14Feb 5, 2025Updated last year
- Agriculture Land and Commission System☆10Updated this week
- Model to determine the expected power output of PV-System based on DWD weather forecast data☆13Jan 17, 2024Updated 2 years ago
- ☆51Oct 23, 2023Updated 2 years ago
- EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue☆38May 26, 2025Updated 9 months ago
- [ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆89Mar 30, 2025Updated 11 months ago
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆174Apr 23, 2025Updated 10 months ago
- [ICLR 2025] A Closer Look at Machine Unlearning for Large Language Models☆45Dec 4, 2024Updated last year