callummcdougall / ARENA_2.0

Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.

☆194

Related projects ⓘ

Alternatives and complementary repositories for ARENA_2.0

TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆195Updated 3 months ago
callummcdougall / ARENA_3.0
☆337Updated this week
ArthurConmy / Automatic-Circuit-Discovery
☆186Updated last month
ndif-team / nnsight
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆399Updated this week
timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆74Updated this week
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆157Updated last month
jbloomAus / SAELens
Training Sparse Autoencoders on Language Models
☆449Updated this week
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆187Updated 3 months ago
METR / task-standard
METR Task Standard
☆122Updated last week
redwoodresearch / mlab
Machine Learning for Alignment Bootcamp
☆63Updated 2 years ago
saprmarks / feature-circuits
☆102Updated last month
saprmarks / dictionary_learning
☆141Updated 2 weeks ago
neelnanda-io / 1L-Sparse-Autoencoder
☆108Updated last year
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆181Updated 11 months ago
redwoodresearch / Easy-Transformer
☆96Updated 3 months ago
EleutherAI / elk
Keeping language models honest by directly eliciting knowledge encoded in their activations.
☆186Updated this week
ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
☆282Updated last week
EleutherAI / sae
Sparse autoencoders
☆333Updated 2 weeks ago
collin-burns / discovering_latent_knowledge
☆252Updated 8 months ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆64Updated last month
anthropics / toy-models-of-superposition
Notebooks accompanying Anthropic's "Toy Models of Superposition" paper
☆95Updated 2 years ago
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆429Updated 5 months ago
jacobdunefsky / transcoder_circuits
☆43Updated 4 months ago
EleutherAI / sae-auto-interp
☆99Updated this week
justinchiu / openlogprobs
Extract full next-token probabilities via language model APIs
☆228Updated 8 months ago
danielmamay / mlab
Machine Learning for Alignment Bootcamp (MLAB).
☆22Updated 2 years ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆94Updated 5 months ago
TomFrederik / unseal
Mechanistic Interpretability for Transformer Models
☆49Updated 2 years ago
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆63Updated last year
likenneth / othello_world
Emergent world representations: Exploring a sequence model trained on a synthetic task
☆168Updated last year