google-research-datasets / dices-datasetLinks
This repository contains two datasets with multi-turn adversarial conversations generated by human agents interacting with a dialog model and rated for safety by two corresponding diverse rater pools.
☆29Updated last year
Alternatives and similar repositories for dices-dataset
Users that are interested in dices-dataset are comparing it to the libraries listed below
Sorting:
- Repository for the Bias Benchmark for QA dataset.☆132Updated last year
- ☆116Updated last year
- ☆28Updated last year
- Steering Llama 2 with Contrastive Activation Addition☆194Updated last year
- [ICLR 2025] General-purpose activation steering library☆120Updated 2 months ago
- Sparse probing paper full code.☆65Updated last year
- The Prism Alignment Project☆86Updated last year
- ☆49Updated last year
- ☆61Updated 4 months ago
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆67Updated last year
- ☆57Updated 2 years ago
- Synthetic question-answering dataset to formally analyze the chain-of-thought output of large language models on a reasoning task.☆154Updated 2 months ago
- ☆52Updated 7 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆116Updated 9 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆99Updated 2 years ago
- Lightweight Adapting for Black-Box Large Language Models☆24Updated last year
- [ACL 2023] Knowledge Unlearning for Mitigating Privacy Risks in Language Models☆84Updated last year
- Inspecting and Editing Knowledge Representations in Language Models☆119Updated 2 years ago
- ☆156Updated 2 years ago
- code for EMNLP 2024 paper: Neuron-Level Knowledge Attribution in Large Language Models☆47Updated last year
- ☆165Updated last year
- ACL 2022: An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models.☆151Updated 3 months ago
- Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".☆80Updated last year
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆155Updated 5 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆145Updated 5 months ago
- Augmenting Statistical Models with Natural Language Parameters☆29Updated last year
- ☆28Updated last year
- A resource repository for representation engineering in large language models☆141Updated last year
- ☆29Updated last year
- This repo contains code for our NeurIPS 2023 spotlight paper: Evaluating and Inducing Personality in Pre-trained Language Models☆55Updated last year