skywalker023 / confaideLinks

🤫 Code and benchmark for our ICLR 2024 spotlight paper: "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory"

☆43

Alternatives and similar repositories for confaide

Users that are interested in confaide are comparing it to the libraries listed below

Sorting:

Princeton-SysML / kNNLM_privacy
Official implementation of Privacy Implications of Retrieval-Based Language Models (EMNLP 2023). https://arxiv.org/abs/2305.14888
☆36Updated last year
mireshghallah / ft-memorization
☆13Updated 2 years ago
joeljang / knowledge-unlearning
[ACL 2023] Knowledge Unlearning for Mitigating Privacy Risks in Language Models
☆82Updated 10 months ago
declare-lab / resta
Restore safety in fine-tuned language models through task arithmetic
☆28Updated last year
Vaidehi99 / InfoDeletionAttacks
☆44Updated 6 months ago
mireshghallah / neighborhood-curvature-mia
☆21Updated last year
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆59Updated 10 months ago
SALT-NLP / Efficient_Unlearning
☆38Updated last year
pratyushmaini / llm_dataset_inference
Official Repository for Dataset Inference for LLMs
☆36Updated last year
snu-mllab / Bayesian-Red-Teaming
About Official PyTorch implementation of "Query-Efficient Black-Box Red Teaming via Bayesian Optimization" (ACL'23)
☆15Updated 2 years ago
weichen-yu / LM-Extraction
☆44Updated 2 years ago
UCSB-NLP-Chang / causal_unlearn
[EMNLP 2024] "Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective"
☆25Updated last year
snw2021 / LLM_Unlearning_Papers
☆26Updated last year
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆74Updated 5 months ago
amazon-science / controlling-llm-memorization
☆36Updated 2 years ago
princeton-nlp / benign-data-breaks-safety
☆41Updated 10 months ago
deeplearning-wisc / picle
Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)
☆25Updated last year
boyiwei / CoTaEval
[NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Models
☆17Updated last year
zjysteven / mink-plus-plus
[ICLR'25 Spotlight] Min-K%++: Improved baseline for detecting pre-training data of LLMs
☆41Updated 2 months ago
ykwon0407 / DataInf
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)
☆73Updated 10 months ago
yihuaihong / ConceptVectors
ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆36Updated 5 months ago
vinid / safety-tuned-llamas
ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
☆85Updated last year
dannyallover / overthinking_the_truth
☆29Updated last year
jaechan-repo / muse_bench
☆26Updated last year
jinzhuoran / RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
☆77Updated 10 months ago
Thartvigsen / GRACE
[NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors
☆78Updated 7 months ago
JasonForJoy / Model-Editing-Hurt
EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue
☆35Updated 2 months ago
javiferran / sae_entities
☆60Updated 5 months ago
xiangyue9607 / Sentence-LDP
Code for the WWW'23 paper "Sanitizing Sentence Embeddings (and Labels) for Local Differential Privacy"
☆12Updated 2 years ago
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆61Updated last year