hendrycks / ethics
Aligning AI With Shared Human Values (ICLR 2021)
☆286Updated 2 years ago
Alternatives and similar repositories for ethics
Users that are interested in ethics are comparing it to the libraries listed below
Sorting:
- ☆223Updated 7 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆92Updated last year
- Mechanistic Interpretability Visualizations using React☆245Updated 5 months ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆75Updated last year
- ☆132Updated 6 months ago
- Repository for research in the field of Responsible NLP at Meta.☆199Updated 5 months ago
- Steering Llama 2 with Contrastive Activation Addition☆151Updated 11 months ago
- Utilities for the HuggingFace transformers library☆67Updated 2 years ago
- ☆206Updated last year
- StereoSet: Measuring stereotypical bias in pretrained language models☆181Updated 2 years ago
- ☆269Updated 10 months ago
- Materials for EACL2024 tutorial: Transformer-specific Interpretability☆53Updated last year
- Dataset associated with "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation" paper☆80Updated 4 years ago
- The official code of LM-Debugger, an interactive tool for inspection and intervention in transformer-based language models.☆177Updated 3 years ago
- ☆114Updated 9 months ago
- Tools for understanding how transformer predictions are built layer-by-layer☆489Updated 11 months ago
- Interpretability for sequence generation models 🐛 🔍☆414Updated 3 weeks ago
- ControlArena is a suite of realistic settings, mimicking complex deployment environments, for running control evaluations. This is an alp…☆57Updated this week
- ☆287Updated 2 weeks ago
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆202Updated last week
- ☆106Updated last year
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆31Updated 11 months ago
- Sparse probing paper full code.☆56Updated last year
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆201Updated 5 months ago
- ☆206Updated 4 years ago
- Erasing concepts from neural representations with provable guarantees☆228Updated 3 months ago
- Improving Alignment and Robustness with Circuit Breakers☆203Updated 7 months ago
- Training data extraction on GPT-2☆186Updated 2 years ago
- A library for finding knowledge neurons in pretrained transformer models.☆157Updated 3 years ago
- Using sparse coding to find distributed representations used by neural networks.☆242Updated last year