SALT-NLP / PrivacyLensLinks
A data construction and evaluation framework to quantify privacy norm awareness of language models (LMs) and emerging privacy risk of LM agents. (NeurIPS 2024 D&B)
☆41Updated 10 months ago
Alternatives and similar repositories for PrivacyLens
Users that are interested in PrivacyLens are comparing it to the libraries listed below
Sorting:
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆93Updated last year
- [ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use☆183Updated last year
- ☆173Updated 3 months ago
- The Paper List on Data Contamination for Large Language Models Evaluation.☆109Updated 2 weeks ago
- Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment☆108Updated last year
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆126Updated 11 months ago
- A lightweight library for large laguage model (LLM) jailbreaking defense.☆61Updated 4 months ago
- Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding☆153Updated last year
- Code for paper "Defending aginast LLM Jailbreaking via Backtranslation"☆34Updated last year
- LLM Unlearning☆181Updated 2 years ago
- Official implementation of ICLR'24 paper, "Curiosity-driven Red Teaming for Large Language Models" (https://openreview.net/pdf?id=4KqkizX…☆87Updated last year
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆74Updated 11 months ago
- [ICLR'26, NAACL'25 Demo] Toolkit & Benchmark for evaluating the trustworthiness of generative foundation models.☆125Updated 5 months ago
- BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).☆175Updated 2 years ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆157Updated 8 months ago
- 【ACL 2024】 SALAD benchmark & MD-Judge☆170Updated 10 months ago
- Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"☆105Updated 8 months ago
- Python package for measuring memorization in LLMs.☆179Updated 6 months ago
- ☆58Updated 2 years ago
- Official implementation of AdvPrompter https//arxiv.org/abs/2404.16873☆174Updated last year
- [ICLR 2025] General-purpose activation steering library☆138Updated 4 months ago
- ☆193Updated 2 years ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆66Updated 7 months ago
- Repo for paper: Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge☆14Updated last year
- A Synthetic Dataset for Personal Attribute Inference (NeurIPS'24 D&B)☆50Updated 6 months ago
- A resource repository for representation engineering in large language models☆148Updated last year
- ☆28Updated last year
- Awesome LLM Self-Consistency: a curated list of Self-consistency in Large Language Models☆119Updated 6 months ago
- Repo for the research paper "SecAlign: Defending Against Prompt Injection with Preference Optimization"☆83Updated 6 months ago
- code repo for ICLR 2024 paper "Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs"☆143Updated last year