hendrycks / ethics
Aligning AI With Shared Human Values (ICLR 2021)
β254Updated last year
Related projects β
Alternatives and complementary repositories for ethics
- Repository for research in the field of Responsible NLP at Meta.β186Updated last week
- Interpretability for sequence generation models π πβ377Updated last week
- β188Updated last month
- datasets from the paper "Towards Understanding Sycophancy in Language Models"β62Updated last year
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spacesβ78Updated last year
- β111Updated last year
- StereoSet: Measuring stereotypical bias in pretrained language modelsβ168Updated last year
- A library for finding knowledge neurons in pretrained transformer models.β151Updated 2 years ago
- PAIR.withgoogle.com and friend's work on interpretability methodsβ150Updated 3 weeks ago
- Training data extraction on GPT-2β176Updated last year
- β94Updated 6 months ago
- RΓΆttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"β63Updated 10 months ago
- β170Updated 9 months ago
- β239Updated 4 months ago
- Sparse probing paper full code.β51Updated 11 months ago
- β122Updated 3 weeks ago
- β275Updated 3 months ago
- Steering Llama 2 with Contrastive Activation Additionβ98Updated 5 months ago
- Dataset associated with "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation" paperβ67Updated 3 years ago
- Repository for the Bias Benchmark for QA dataset.β87Updated 10 months ago
- β98Updated 3 months ago
- Mechanistic Interpretability Visualizations using Reactβ200Updated 4 months ago
- The official code of LM-Debugger, an interactive tool for inspection and intervention in transformer-based language models.β172Updated 2 years ago
- This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".β86Updated 3 years ago
- ACL 2022: An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models.β126Updated last year
- Run safety benchmarks against AI models and view detailed reports showing how well they performed.β62Updated this week
- β221Updated last year
- Data for evaluating gender bias in coreference resolution systems.β68Updated 5 years ago
- A resource repository for representation engineering in large language modelsβ54Updated last week
- [ICML 2021] Towards Understanding and Mitigating Social Biases in Language Modelsβ60Updated 2 years ago