skywalker023 / confaideLinks
π€« Code and benchmark for our ICLR 2024 spotlight paper: "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory"
β44Updated last year
Alternatives and similar repositories for confaide
Users that are interested in confaide are comparing it to the libraries listed below
Sorting:
- [ACL 2023] Knowledge Unlearning for Mitigating Privacy Risks in Language Modelsβ82Updated last year
- Official implementation of Privacy Implications of Retrieval-Based Language Models (EMNLP 2023). https://arxiv.org/abs/2305.14888β36Updated last year
- β45Updated 7 months ago
- β13Updated 2 years ago
- Restore safety in fine-tuned language models through task arithmeticβ28Updated last year
- β39Updated 2 years ago
- Official Repository for Dataset Inference for LLMsβ41Updated last year
- [ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"β60Updated 11 months ago
- β38Updated last year
- About Official PyTorch implementation of "Query-Efficient Black-Box Red Teaming via Bayesian Optimization" (ACL'23)β15Updated 2 years ago
- [EMNLP 2024] "Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective"β28Updated last year
- β22Updated 2 years ago
- [ICLR'25 Spotlight] Min-K%++: Improved baseline for detecting pre-training data of LLMsβ44Updated 3 months ago
- β43Updated 2 years ago
- β41Updated 11 months ago
- β26Updated last year
- Implementation of the paper "Exploring the Universal Vulnerability of Prompt-based Learning Paradigm" on Findings of NAACL 2022β30Updated 3 years ago
- EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescueβ36Updated 3 months ago
- [EMNLP 2023] Poisoning Retrieval Corpora by Injecting Adversarial Passages https://arxiv.org/abs/2310.19156β37Updated last year
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.β78Updated 6 months ago
- [NeurIPS 2024 D&B] Evaluating Copyright Takedown Methods for Language Modelsβ17Updated last year
- [NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptorsβ80Updated 9 months ago
- Code for watermarking language modelsβ82Updated last year
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.β86Updated last year
- [EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"β35Updated last month
- β57Updated 2 years ago
- Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)β26Updated last year
- RΓΆttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"β111Updated 6 months ago
- The git repository of Modular Prompted Chatbot paperβ35Updated 2 years ago
- source code for NeurIPS'24 paper "HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection"β54Updated 5 months ago