dit7ya / awesome-ai-alignmentLinks

A curated list of awesome resources for Artificial Intelligence Alignment research

☆71

Alternatives and similar repositories for awesome-ai-alignment

Users that are interested in awesome-ai-alignment are comparing it to the libraries listed below

Sorting:

EleutherAI / elk
Keeping language models honest by directly eliciting knowledge encoded in their activations.
☆209Updated last week
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆73Updated 2 years ago
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆86Updated last year
moirage / alignment-research-dataset
A dataset of alignment research and code to reproduce it
☆77Updated 2 years ago
aypan17 / machiavelli
☆137Updated 2 weeks ago
collin-burns / discovering_latent_knowledge
☆274Updated last year
centerforaisafety / Intro_to_ML_Safety
☆73Updated 2 years ago
likenneth / othello_world
Emergent world representations: Exploring a sequence model trained on a synthetic task
☆186Updated 2 years ago
anthropics / evals
☆287Updated last year
timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆100Updated last month
Sea-Snell / grokking
unofficial re-implementation of "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets"
☆78Updated 3 years ago
google-deepmind / dangerous-capability-evaluations
☆55Updated last week
normster / llm_rules
RuLES: a benchmark for evaluating rule-following in language models
☆228Updated 5 months ago
callummcdougall / ARENA_2.0
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
☆220Updated last year
thestephencasper / everything-you-need
we got you bro
☆36Updated last year
srush / GPTWorld
A puzzle to learn about prompting
☆132Updated 2 years ago
EleutherAI / concept-erasure
Erasing concepts from neural representations with provable guarantees
☆232Updated 6 months ago
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆512Updated last year
bilal-chughtai / rep-theory-mech-interp
☆26Updated 2 years ago
AsaCooperStickland / situational-awareness-evals
Measuring the situational awareness of language models
☆37Updated last year
nostalgebraist / transformer-utils
Utilities for the HuggingFace transformers library
☆70Updated 2 years ago
ArthurConmy / Automatic-Circuit-Discovery
☆234Updated 10 months ago
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆72Updated last year
stanford-crfm / ecosystem-graphs
☆267Updated 6 months ago
lukasberglund / reversal_curse
☆291Updated last year
HannahKirk / prism-alignment
The Prism Alignment Project
☆79Updated last year
sangmichaelxie / cs324_p2
Project 2 (Building Large Language Models) for Stanford CS324: Understanding and Developing Large Language Models (Winter 2022)
☆105Updated 2 years ago
anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆111Updated last year
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆272Updated 7 months ago
tomekkorbak / pretraining-with-human-feedback
Code accompanying the paper Pretraining Language Models with Human Preferences
☆182Updated last year