EffiSciencesResearch / ML4GLinks

Machine Learning for Alignment Bootcamp

☆25

Alternatives and similar repositories for ML4G

Users that are interested in ML4G are comparing it to the libraries listed below

Sorting:

timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆114Updated 5 months ago
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆75Updated 2 years ago
thestephencasper / everything-you-need
we got you bro
☆36Updated last year
callummcdougall / ARENA_2.0
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
☆232Updated 3 months ago
redwoodresearch / mlab
Machine Learning for Alignment Bootcamp
☆81Updated 3 years ago
EleutherAI / elk
Keeping language models honest by directly eliciting knowledge encoded in their activations.
☆214Updated last week
redwoodresearch / remix_public
☆20Updated 2 years ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆302Updated 11 months ago
Mech-Interp / PySvelte
A library for bridging Python and HTML/Javascript (via Svelte) for creating interactive visualizations
☆14Updated last year
annahdo / implementing_activation_steering
A collection of different ways to implement accessing and modifying internal model activations for LLMs
☆19Updated last year
EleutherAI / concept-erasure
Erasing concepts from neural representations with provable guarantees
☆239Updated 10 months ago
UKGovernmentBEIS / control-arena
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆129Updated this week
moirage / alignment-research-dataset
A dataset of alignment research and code to reproduce it
☆78Updated 2 years ago
METR / task-standard
METR Task Standard
☆168Updated 9 months ago
LRudL / evalugator
(Model-written) LLM evals library
☆18Updated 11 months ago
mishajw / repeng
Experiments with representation engineering
☆13Updated last year
collin-burns / discovering_latent_knowledge
☆283Updated last year
TomFrederik / unseal
Mechanistic Interpretability for Transformer Models
☆53Updated 3 years ago
yash-srivastava19 / arrakis
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
☆31Updated 7 months ago
anthropics / toy-models-of-superposition
Notebooks accompanying Anthropic's "Toy Models of Superposition" paper
☆130Updated 3 years ago
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆228Updated 11 months ago
MinhxLe / subliminal-learning
☆104Updated 3 months ago
ApolloResearch / apd
Attribution-based Parameter Decomposition
☆32Updated 5 months ago
ArthurConmy / Automatic-Circuit-Discovery
☆255Updated last year
jessicarumbelow / Backwards
☆85Updated last year
bilal-chughtai / rep-theory-mech-interp
☆27Updated 2 years ago
danielmamay / mlab
Machine Learning for Alignment Bootcamp (MLAB).
☆30Updated 3 years ago
callummcdougall / sae_visualizer
☆29Updated last year
anthropics / evals
☆313Updated last year
neelnanda-io / Grokking
A Mechanistic Interpretability Analysis of Grokking
☆23Updated 3 years ago