hendrycks / ethicsLinks
Aligning AI With Shared Human Values (ICLR 2021)
☆288Updated 2 years ago
Alternatives and similar repositories for ethics
Users that are interested in ethics are comparing it to the libraries listed below
Sorting:
- Repository for research in the field of Responsible NLP at Meta.☆199Updated last month
- ☆291Updated last week
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆81Updated last year
- StereoSet: Measuring stereotypical bias in pretrained language models☆184Updated 2 years ago
- ☆136Updated last year
- Repository for the Bias Benchmark for QA dataset.☆118Updated last year
- Training data extraction on GPT-2☆187Updated 2 years ago
- ☆134Updated 7 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆94Updated last year
- ☆227Updated 8 months ago
- The official code of LM-Debugger, an interactive tool for inspection and intervention in transformer-based language models.☆177Updated 3 years ago
- ☆280Updated 11 months ago
- PAIR.withgoogle.com and friend's work on interpretability methods☆192Updated this week
- The Prism Alignment Project☆78Updated last year
- This repository contains the data and code introduced in the paper "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Maske…☆120Updated last year
- A library for finding knowledge neurons in pretrained transformer models.☆158Updated 3 years ago
- ☆106Updated last year
- ☆270Updated last year
- Dataset associated with "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation" paper☆80Updated 4 years ago
- ☆213Updated last year
- General-purpose activation steering library☆81Updated last month
- ACL 2022: An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models.☆138Updated 6 months ago
- Synthetic question-answering dataset to formally analyze the chain-of-thought output of large language models on a reasoning task.☆147Updated 8 months ago
- Data for evaluating gender bias in coreference resolution systems.☆77Updated 6 years ago
- ☆120Updated 10 months ago
- A collection of different ways to implement accessing and modifying internal model activations for LLMs☆18Updated 8 months ago
- Improving Alignment and Robustness with Circuit Breakers☆214Updated 9 months ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆100Updated 4 months ago
- Interpretability for sequence generation models 🐛 🔍☆425Updated 2 months ago
- A library for efficient patching and automatic circuit discovery.☆67Updated 2 months ago