danielmamay / mlabLinks

Machine Learning for Alignment Bootcamp (MLAB).

☆30

Alternatives and similar repositories for mlab

Users that are interested in mlab are comparing it to the libraries listed below

Sorting:

redwoodresearch / mlab
Machine Learning for Alignment Bootcamp
☆79Updated 3 years ago
callummcdougall / ARENA_2.0
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
☆228Updated 2 months ago
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆75Updated 2 years ago
timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆109Updated 3 months ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆291Updated 10 months ago
callummcdougall / ARENA_3.0
☆747Updated 2 weeks ago
thestephencasper / everything-you-need
we got you bro
☆36Updated last year
EffiSciencesResearch / ML4G
Machine Learning for Alignment Bootcamp
☆26Updated last year
ArthurConmy / Automatic-Circuit-Discovery
☆244Updated last year
EleutherAI / elk
Keeping language models honest by directly eliciting knowledge encoded in their activations.
☆208Updated last week
UKGovernmentBEIS / control-arena
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆100Updated last week
dit7ya / awesome-ai-alignment
A curated list of awesome resources for Artificial Intelligence Alignment research
☆71Updated 2 years ago
METR / task-standard
METR Task Standard
☆163Updated 8 months ago
collin-burns / discovering_latent_knowledge
☆278Updated last year
srush / Transformer-Puzzles
Puzzles for exploring transformers
☆371Updated 2 years ago
ndif-team / nnsight
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆683Updated this week
redwoodresearch / remix_public
☆19Updated 2 years ago
anthropics / evals
☆304Updated last year
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆221Updated 10 months ago
annahdo / implementing_activation_steering
A collection of different ways to implement accessing and modifying internal model activations for LLMs
☆19Updated last year
safety-research / safety-tooling
Inference API for many LLMs and other useful tools for empirical research
☆77Updated 2 weeks ago
ARBORproject / arborproject.github.io
☆81Updated 7 months ago
METR / public-tasks
☆104Updated this week
LRudL / evalugator
(Model-written) LLM evals library
☆18Updated 10 months ago
anthropics / toy-models-of-superposition
Notebooks accompanying Anthropic's "Toy Models of Superposition" paper
☆129Updated 3 years ago
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆272Updated last year
UlisseMini / procgen-tools
Tools for running experiments on RL agents in procgen environments
☆19Updated last year
ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
☆374Updated 11 months ago
neelnanda-io / 1L-Sparse-Autoencoder
☆128Updated last year
mishajw / repeng
Experiments with representation engineering
☆13Updated last year