SALT-NLP / PrivacyLens
A data construction and evaluation framework to quantify privacy norm awareness of language models (LMs) and emerging privacy risk of LM agents. (NeurIPS 2024 D&B)
☆25Updated last month
Alternatives and similar repositories for PrivacyLens:
Users that are interested in PrivacyLens are comparing it to the libraries listed below
- [NeurIPS 2023 D&B Track] Code and data for paper "Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evalua…☆33Updated last year
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆135Updated last year
- ☆38Updated last year
- ☆31Updated last year
- ☆20Updated 6 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆96Updated last month
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆78Updated 10 months ago
- Weak-to-Strong Jailbreaking on Large Language Models☆72Updated last year
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆96Updated last year
- NLPBench: Evaluating NLP-Related Problem-solving Ability in Large Language Models☆10Updated last year
- A resource repository for representation engineering in large language models☆117Updated 4 months ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆41Updated 2 months ago
- ☆25Updated 10 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆71Updated 3 weeks ago
- ☆47Updated last year
- Repository for the Bias Benchmark for QA dataset.☆107Updated last year
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆52Updated 4 months ago
- A lightweight library for large laguage model (LLM) jailbreaking defense.☆49Updated 5 months ago
- ☆29Updated 11 months ago
- code for EMNLP 2024 paper: Neuron-Level Knowledge Attribution in Large Language Models☆29Updated 4 months ago
- ☆42Updated last month
- [ICLR 2025] Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization☆24Updated 2 months ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆109Updated 11 months ago
- ☆38Updated last year
- ☆25Updated 6 months ago
- Data and code for the preprint "In-Context Learning with Long-Context Models: An In-Depth Exploration"☆34Updated 7 months ago
- AmpleGCG: Learning a Universal and Transferable Generator of Adversarial Attacks on Both Open and Closed LLM☆59Updated 5 months ago
- Augmenting Statistical Models with Natural Language Parameters☆24Updated 6 months ago
- ☆155Updated 4 months ago
- Github repository for "FELM: Benchmarking Factuality Evaluation of Large Language Models" (NeurIPS 2023)☆58Updated last year