[CIKM 2024] Trojan Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment.
β30Jul 29, 2024Updated last year
Alternatives and similar repositories for Trojan-Activation-Attack
Users that are interested in Trojan-Activation-Attack are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- π Paper list on decoding methods for LLMs and LVLMsβ72Nov 7, 2025Updated 6 months ago
- Code for our NeurIPS 2024 paper Improved Generation of Adversarial Examples Against Safety-aligned LLMsβ12Nov 7, 2024Updated last year
- β31Feb 27, 2025Updated last year
- Code for the paper "Self-Detoxifying Language Models via Toxification Reversal" (EMNLP 2023)β18Oct 17, 2023Updated 2 years ago
- [ICLR'26, NAACL'25 Demo] Toolkit & Benchmark for evaluating the trustworthiness of generative foundation models.β130Aug 22, 2025Updated 9 months ago
- Deploy to Railway using AI coding agents - Free Credits Offer β’ AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Unofficial implementation of "Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection"β28Jul 6, 2024Updated last year
- Implementation of paper 'Reversing the Forget-Retain Objectives: An Efficient LLM Unlearning Framework from Logit Difference' [NeurIPS'24β¦β26Jun 14, 2024Updated last year
- Official Code for "Baseline Defenses for Adversarial Attacks Against Aligned Language Models"β34Oct 26, 2023Updated 2 years ago
- Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimizationβ45Jul 28, 2024Updated last year
- Precision Knowledge Editing (PKE): A novel method to reduce toxicity in LLMs while preserving performance, with robust evaluations and haβ¦β11Nov 26, 2024Updated last year
- Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)β28Jun 27, 2024Updated last year
- β73Feb 16, 2025Updated last year