SALT-NLP / PrivacyLensLinks
A data construction and evaluation framework to quantify privacy norm awareness of language models (LMs) and emerging privacy risk of LM agents. (NeurIPS 2024 D&B)
☆32Updated 8 months ago
Alternatives and similar repositories for PrivacyLens
Users that are interested in PrivacyLens are comparing it to the libraries listed below
Sorting:
- ☆122Updated last week
- Toolkit for evaluating the trustworthiness of generative foundation models.☆122Updated 2 months ago
- A resource repository for representation engineering in large language models☆140Updated 11 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆146Updated last year
- The Paper List on Data Contamination for Large Language Models Evaluation.☆102Updated 2 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆116Updated 8 months ago
- [NeurIPS 2023 D&B Track] Code and data for paper "Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evalua…☆35Updated 2 years ago
- A lightweight library for large laguage model (LLM) jailbreaking defense.☆58Updated last month
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆171Updated last year
- code for EMNLP 2024 paper: Neuron-Level Knowledge Attribution in Large Language Models☆47Updated 11 months ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆88Updated last year
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆98Updated 2 years ago
- A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..☆283Updated 7 months ago
- ☆57Updated 2 years ago
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆106Updated last year
- A curated list of resources for activation engineering☆107Updated last month
- Repo for paper: Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge☆14Updated last year
- ☆164Updated 11 months ago
- A Comprehensive Assessment of Trustworthiness in GPT Models☆305Updated last year
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆84Updated last year
- [ICLR 2025] General-purpose activation steering library☆115Updated last month
- Repository for the Bias Benchmark for QA dataset.☆129Updated last year
- LoFiT: Localized Fine-tuning on LLM Representations☆42Updated 9 months ago
- ☆208Updated 11 months ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆152Updated 5 months ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆168Updated last year
- A Synthetic Dataset for Personal Attribute Inference (NeurIPS'24 D&B)☆45Updated 3 months ago
- LLM Unlearning☆177Updated 2 years ago
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆99Updated 5 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆83Updated 8 months ago