safety-research / safety-examplesLinks

☆19

Alternatives and similar repositories for safety-examples

Users that are interested in safety-examples are comparing it to the libraries listed below

Sorting:

safety-research / safety-tooling
Inference API for many LLMs and other useful tools for empirical research
☆77Updated last week
amack315 / unsupervised-steering-vectors
☆36Updated last year
METR / task-standard
METR Task Standard
☆163Updated 8 months ago
noanabeshima / tinymodel
A TinyStories LM with SAEs and transcoders
☆13Updated 7 months ago
UKGovernmentBEIS / control-arena
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆115Updated this week
timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆112Updated 4 months ago
annahdo / implementing_activation_steering
A collection of different ways to implement accessing and modifying internal model activations for LLMs
☆19Updated last year
callummcdougall / ARENA_2.0
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
☆230Updated 2 months ago
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆55Updated 6 months ago
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆59Updated last year
mishajw / repeng
Experiments with representation engineering
☆13Updated last year
jbloomAus / SAEDashboard
☆77Updated 3 weeks ago
redwoodresearch / alignment_faking_public
☆79Updated 3 weeks ago
curt-tigges / probity
☆19Updated 6 months ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆296Updated 10 months ago
Butanium / tiny-activation-dashboard
A tiny easily hackable implementation of a feature dashboard.
☆15Updated last week
ApolloResearch / apd
Attribution-based Parameter Decomposition
☆31Updated 4 months ago
redwoodresearch / mlab
Machine Learning for Alignment Bootcamp
☆79Updated 3 years ago
google-deepmind / dangerous-capability-evaluations
☆61Updated last month
anthropics / toy-models-of-superposition
Notebooks accompanying Anthropic's "Toy Models of Superposition" paper
☆129Updated 3 years ago
callummcdougall / sae_visualizer
☆29Updated last year
neelnanda-io / Crosscoders
☆55Updated 11 months ago
ArthurConmy / Automatic-Circuit-Discovery
☆248Updated last year
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆224Updated 10 months ago
slavachalnev / SAE-TS
Improving Steering Vectors by Targeting Sparse Autoencoder Features
☆25Updated 11 months ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆78Updated 3 months ago
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆74Updated 2 years ago
Butanium / nnterp
Unified access to Large Language Model modules using NNsight
☆52Updated last week
oli-clive-griffin / crosscode
A library for training crosscoders
☆12Updated 5 months ago
rgreenblatt / control-evaluations
☆18Updated last year