richhh520 / Learnable-Privacy-Neurons-Localization

ACL 2024 Learnable Privacy Neurons Localization in Language Models

☆10

Related projects ⓘ

Alternatives and complementary repositories for Learnable-Privacy-Neurons-Localization

Improbable-AI / curiosity_redteam
Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…
☆62Updated 8 months ago
flamewei123 / DEPN
☆20Updated 7 months ago
Vaidehi99 / InfoDeletionAttacks
☆38Updated last year
boyiwei / alignment-attribution-code
Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆60Updated last month
princeton-nlp / benign-data-breaks-safety
☆21Updated last month
ChenWu98 / agent-attack
[Arxiv 2024] Adversarial attacks on multimodal agents
☆39Updated 4 months ago
sail-sg / I-FSJ
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆48Updated 3 months ago
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆84Updated 6 months ago
renqibing / CodeAttack
[ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
☆29Updated 3 weeks ago
ydyjya / LLM-IHS-Explanation
☆31Updated 5 months ago
zjunlp / KnowledgeCircuits
[NeurIPS 2024] Knowledge Circuits in Pretrained Transformers
☆75Updated last month
lancopku / agent-backdoor-attacks
Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents" [NeurIPS 2024]
☆44Updated last month
centerforaisafety / wmdp
WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…
☆82Updated 6 months ago
OpenSafetyLab / SALAD-BENCH
【ACL 2024】 SALAD benchmark & MD-Judge
☆106Updated last month
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆47Updated last month
chujiezheng / LLM-Safeguard
Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"
☆73Updated 2 months ago
zhaoyiran924 / Probe-Sampling
[NeurIPS 2024] Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling
☆19Updated 2 weeks ago
isXinLiu / MM-SafetyBench
Accepted by ECCV 2024
☆74Updated last month
srzer / MOD
Official code for "Decoding-Time Language Model Alignment with Multiple Objectives".
☆14Updated 3 weeks ago
thu-coai / SafeUnlearning
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
☆21Updated 4 months ago
boyiwei / CoTaEval
Official code for the paper: Evaluating Copyright Takedown Methods for Language Models
☆15Updated 4 months ago
facebookresearch / advprompter
Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873
☆122Updated 6 months ago
jinzhuoran / RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
☆62Updated last month
ChnQ / TracingLLM
☆23Updated 6 months ago
licong-lin / negative-preference-optimization
☆35Updated 4 months ago
thestephencasper / explore_establish_exploit_llms
☆31Updated last year
renqibing / ActorAttack
☆53Updated 3 weeks ago
eth-sri / llmprivacy
☆46Updated 5 months ago
mireshghallah / neighborhood-curvature-mia
☆20Updated last year
MiaoXiong2320 / llm-uncertainty
code repo for ICLR 2024 paper "Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs"
☆74Updated 8 months ago