hendrycks / ethicsLinks

Aligning AI With Shared Human Values (ICLR 2021)

☆302

Alternatives and similar repositories for ethics

Users that are interested in ethics are comparing it to the libraries listed below

Sorting:

facebookresearch / ResponsibleNLP
Repository for research in the field of Responsible NLP at Meta.
☆202Updated 5 months ago
amazon-science / bold
Dataset associated with "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation" paper
☆81Updated 4 years ago
tatsu-lab / opinions_qa
☆116Updated last year
moinnadeem / StereoSet
StereoSet: Measuring stereotypical bias in pretrained language models
☆191Updated 2 years ago
nyu-mll / BBQ
Repository for the Bias Benchmark for QA dataset.
☆129Updated last year
anthropics / evals
☆305Updated last year
allenai / real-toxicity-prompts
☆221Updated 4 years ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆98Updated 2 years ago
McGill-NLP / bias-bench
ACL 2022: An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models.
☆149Updated 2 months ago
PAIR-code / interpretability
PAIR.withgoogle.com and friend's work on interpretability methods
☆207Updated last month
mega002 / lm-debugger
The official code of LM-Debugger, an interactive tool for inspection and intervention in transformer-based language models.
☆179Updated 3 years ago
krishnap25 / mauve
Package to compute Mauve, a similarity score between neural text and human text. Install with `pip install mauve-text`.
☆298Updated last year
HannahKirk / prism-alignment
The Prism Alignment Project
☆83Updated last year
timoschick / self-debiasing
This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".
☆88Updated 4 years ago
evandez / REMEDI
Inspecting and Editing Knowledge Representations in Language Models
☆119Updated 2 years ago
nostalgebraist / transformer-utils
Utilities for the HuggingFace transformers library
☆72Updated 2 years ago
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆95Updated 2 years ago
inseq-team / inseq
Interpretability for sequence generation models 🐛 🔍
☆443Updated last month
microsoft / TOXIGEN
This repo contains the code for generating the ToxiGen dataset, published at ACL 2022.
☆335Updated last year
ArthurConmy / Automatic-Circuit-Discovery
☆248Updated last year
google-research / lm-extraction-benchmark
☆293Updated 2 months ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆191Updated last year
EleutherAI / knowledge-neurons
A library for finding knowledge neurons in pretrained transformer models.
☆159Updated 3 years ago
anthropics / ConstitutionalHarmlessnessPaper
☆243Updated 2 years ago
wesg52 / sparse-probing-paper
Sparse probing paper full code.
☆62Updated last year
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆114Updated last month
aypan17 / machiavelli
☆138Updated 3 months ago
collin-burns / discovering_latent_knowledge
☆280Updated last year
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆535Updated 2 months ago
tonyzhaozh / few-shot-learning
Few-shot Learning of GPT-3
☆356Updated 2 years ago