safety-research/safety-tooling

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/safety-research/safety-tooling)

safety-research / safety-tooling

Inference API for many LLMs and other useful tools for empirical research

☆108

Alternatives and similar repositories for safety-tooling

Users that are interested in safety-tooling are comparing it to the libraries listed below

Sorting:

safety-research / safety-examples
View on GitHub
☆25Nov 11, 2025Updated 3 months ago
UKGovernmentBEIS / control-arena
View on GitHub
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆160Feb 27, 2026Updated last week
longtermrisk / openweights
View on GitHub
A python sdk for LLM finetuning and inference on runpod infrastructure
☆20Feb 16, 2026Updated 3 weeks ago
rgreenblatt / control-evaluations
View on GitHub
☆20May 25, 2024Updated last year
ckkissane / crosscoder-model-diff-replication
View on GitHub
Open source replication of Anthropic's Crosscoders for Model Diffing
☆64Oct 27, 2024Updated last year
dtch1997 / steering-bench
View on GitHub
Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"
☆19Dec 14, 2024Updated last year
METR / vivaria
View on GitHub
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
☆134Feb 15, 2026Updated 3 weeks ago
marcus-jw / Targeted-Manipulation-and-Deception-in-LLMs
View on GitHub
Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-tur…
☆23Dec 3, 2024Updated last year
annahdo / implementing_activation_steering
View on GitHub
A collection of different ways to implement accessing and modifying internal model activations for LLMs
☆20Oct 18, 2024Updated last year
raybears / cot-transparency
View on GitHub
Improving transparency of large language models' reasoning
☆14Nov 25, 2025Updated 3 months ago
science-of-finetuning / sparsity-artifacts-crosscoders
View on GitHub
Code for the "Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning" paper.
☆16Nov 21, 2025Updated 3 months ago
tim-lawson / mlsae
View on GitHub
Multi-Layer Sparse Autoencoders (ICLR 2025)
☆29Feb 6, 2026Updated last month
redwoodresearch / mlab
View on GitHub
Machine Learning for Alignment Bootcamp
☆82Apr 27, 2022Updated 3 years ago
cadentj / caft
View on GitHub
☆24Oct 3, 2025Updated 5 months ago
ckkissane / sae-transfer
View on GitHub
Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"
☆13Jul 18, 2024Updated last year
tianyu139 / tangent-model-composition
View on GitHub
Code for Tangent Model Composition for Ensembling and Continual Fine-tuning (ICCV 2023) and Tangent Transformers for Composition, Privacy…
☆13May 14, 2024Updated last year
safety-research / SHADE-Arena
View on GitHub
☆21Jun 22, 2025Updated 8 months ago
science-of-finetuning / diffing-toolkit
View on GitHub
A toolkit that provides a range of model diffing techniques including a UI to visualize them interactively.
☆66Updated this week
ApolloResearch / apd
View on GitHub
Attribution-based Parameter Decomposition
☆34Jun 11, 2025Updated 8 months ago
neilrathi / token-filtering
View on GitHub
Shaping capabilities with token-level pretraining data filtering
☆84Jan 28, 2026Updated last month
hijohnnylin / automated-interpretability
View on GitHub
☆22Feb 13, 2026Updated 3 weeks ago
noanabeshima / tinymodel
View on GitHub
A TinyStories LM with SAEs and transcoders
☆14Apr 3, 2025Updated 11 months ago
lasgroup / SafetyPolytope
View on GitHub
Learning Safety Constraints for Large Language Models (ICML2025)
☆31Aug 4, 2025Updated 7 months ago
FlyingPumba / InterpBench
View on GitHub
A benchmark for mechanistic discovery of circuits in Transformers
☆16Dec 15, 2024Updated last year
jkutaso / SHADE-Arena
View on GitHub
☆36May 9, 2025Updated 10 months ago
safety-research / petri
View on GitHub
An alignment auditing agent capable of quickly exploring alignment hypothesis
☆934Feb 28, 2026Updated last week
Aaquib111 / edge-attribution-patching
View on GitHub
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"
☆47May 31, 2024Updated last year
jqhoogland / autoanki
View on GitHub
Automatically create Anki cards from text using language models
☆20Jan 7, 2023Updated 3 years ago
ndif-team / nnsight
View on GitHub
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆836Updated this week
jbloomAus / SAEDashboard
View on GitHub
☆89Dec 18, 2025Updated 2 months ago
goodfire-ai / scribe
View on GitHub
☆75Feb 18, 2026Updated 2 weeks ago
callummcdougall / ARENA_3.0
View on GitHub
☆979Updated this week
nrimsky / InfluenceFunctions
View on GitHub
Implementation of Influence Function approximations for differently sized ML models, using PyTorch
☆16Sep 15, 2023Updated 2 years ago
bluedotimpact / bluedot
View on GitHub
✨ Monorepo containing most of BlueDot Impact's custom software.
☆24Mar 3, 2026Updated last week
JasonGross / guarantees-based-mechanistic-interpretability
View on GitHub
☆18Feb 25, 2026Updated last week
UFO-101 / auto-circuit
View on GitHub
A library for efficient patching and automatic circuit discovery.
☆91Dec 31, 2025Updated 2 months ago
Heidelberg-NLP / CC-SHAP
View on GitHub
Code for "On Measuring Faithfulness of Natural Language Explanations"
☆21Jul 23, 2024Updated last year
koayon / atp_star
View on GitHub
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
☆20Jan 19, 2025Updated last year
okarthikb / DPO
View on GitHub
Implementation of Direct Preference Optimization
☆17Jul 17, 2023Updated 2 years ago