kutay25 / ai-safety-alignment-campsLinks

An repository of 2025-2026 AI Safety and Alignment programs, camps, and workshops.

☆22

Alternatives and similar repositories for ai-safety-alignment-camps

Users that are interested in ai-safety-alignment-camps are comparing it to the libraries listed below

Sorting:

callummcdougall / ARENA_3.0
☆646Updated last week
callummcdougall / ARENA_2.0
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
☆221Updated last year
ndif-team / nnsight
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆622Updated last week
curt-tigges / probity
☆15Updated 4 months ago
jbloomAus / SAELens
Training Sparse Autoencoders on Language Models
☆900Updated last week
saprmarks / dictionary_learning
☆326Updated 3 weeks ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆273Updated 7 months ago
ArthurConmy / Automatic-Circuit-Discovery
☆234Updated 10 months ago
neelnanda-io / Crosscoders
☆51Updated 8 months ago
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆257Updated last year
delphi-suite / delphi
small language models training made easy
☆13Updated 7 months ago
ApolloResearch / apd
Attribution-based Parameter Decomposition
☆28Updated 2 months ago
EleutherAI / sparsify
Sparsify transformers with SAEs and transcoders
☆598Updated last week
saprmarks / feature-circuits
☆183Updated 3 weeks ago
ARBORproject / arborproject.github.io
☆81Updated 5 months ago
science-of-finetuning / crosscoder_learning
Modified to support crosscoder training.
☆21Updated 2 weeks ago
safety-research / safety-tooling
Inference API for many LLMs and other useful tools for empirical research
☆63Updated this week
adamkarvonen / SAEBench
☆109Updated 3 weeks ago
timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆100Updated last month
UKGovernmentBEIS / control-arena
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆82Updated this week
ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
☆368Updated 9 months ago
Butanium / nnterp
Unified access to Large Language Model modules using NNsight
☆38Updated 2 weeks ago
redwoodresearch / mlab
Machine Learning for Alignment Bootcamp
☆76Updated 3 years ago
Dakingrai / awesome-mechanistic-interpretability-lm-papers
☆180Updated 8 months ago
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆202Updated this week
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆261Updated last year
LRudL / evalugator
(Model-written) LLM evals library
☆18Updated 7 months ago
jacobdunefsky / llm-steering-opt
Tools for optimizing steering vectors in LLMs.
☆11Updated 4 months ago
jacobdunefsky / transcoder_circuits
☆157Updated 8 months ago
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆73Updated 2 years ago