google-research-datasets / dices-dataset
This repository contains two datasets with multi-turn adversarial conversations generated by human agents interacting with a dialog model and rated for safety by two corresponding diverse rater pools.
☆25Updated 7 months ago
Alternatives and similar repositories for dices-dataset:
Users that are interested in dices-dataset are comparing it to the libraries listed below
- This repository includes code for the paper "Does Localization Inform Editing? Surprising Differences in Where Knowledge Is Stored vs. Ca…☆59Updated last year
- ☆81Updated last week
- The Prism Alignment Project☆66Updated 9 months ago
- How do transformer LMs encode relations?☆46Updated 11 months ago
- ☆36Updated last year
- Tasks for describing differences between text distributions.☆16Updated 6 months ago
- ☆31Updated last year
- The accompanying code for "Transformer Feed-Forward Layers Are Key-Value Memories". Mor Geva, Roei Schuster, Jonathan Berant, and Omer Le…☆89Updated 3 years ago
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks☆40Updated 2 months ago
- Source code and data for ADEPT: A DEbiasing PrompT Framework (AAAI-23).☆14Updated 2 months ago
- ☆104Updated 9 months ago
- ☆44Updated 5 months ago
- ☆47Updated last year
- Inspecting and Editing Knowledge Representations in Language Models☆112Updated last year
- Code repository for the paper "Mission: Impossible Language Models."☆47Updated this week
- Code for our EMNLP '22 paper "Fixing Model Bugs with Natural Language Patches"☆19Updated 2 years ago
- 👩💻 Code for the ACL paper "Detecting Edit Failures in LLMs: An Improved Specificity Benchmark"☆20Updated last year
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆101Updated 9 months ago
- This repository contains data, code and models for contextual noncompliance.☆20Updated 7 months ago
- Augmenting Statistical Models with Natural Language Parameters☆23Updated 5 months ago
- ☆44Updated 6 months ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆71Updated last year
- ☆35Updated 2 years ago
- Benchmarking Generalization to New Tasks from Natural Language Instructions☆26Updated 3 years ago
- Implementation of PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆32Updated 3 months ago
- ☆40Updated last week
- ☆33Updated 4 months ago
- Evaluating the Moral Beliefs Encoded in LLMs☆23Updated 2 months ago
- [NAACL'25] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆47Updated 2 months ago
- Evaluate the Quality of Critique☆35Updated 8 months ago