CentreSecuriteIA / BELLSLinks

Benchmarks for the Evaluation of LLM Supervision

☆32

Alternatives and similar repositories for BELLS

Users that are interested in BELLS are comparing it to the libraries listed below

Sorting:

UKGovernmentBEIS / control-arena
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆76Updated this week
UKGovernmentBEIS / inspect_evals
Collection of evals for Inspect AI
☆178Updated last week
METR / task-standard
METR Task Standard
☆154Updated 5 months ago
UKGovernmentBEIS / inspect_ai
Inspect: A framework for large language model evaluations
☆1,145Updated this week
ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
☆363Updated 8 months ago
anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆109Updated last year
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆262Updated 7 months ago
guidelabs / infembed
Find the samples, in the test data, on which your (generative) model makes mistakes.
☆28Updated 9 months ago
centerforaisafety / Intro_to_ML_Safety
☆72Updated 2 years ago
stanfordnlp / pyvene
Stanford NLP Python library for understanding and improving PyTorch models via interventions
☆770Updated this week
google-deepmind / concordia
A library for generative social simulation
☆936Updated this week
haizelabs / redteaming-resistance-benchmark
☆45Updated 11 months ago
callummcdougall / ARENA_3.0
☆617Updated last week
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆206Updated 7 months ago
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆256Updated last year
rgreenblatt / control-evaluations
☆15Updated last year
compl-ai / compl-ai
An open-source compliance-centered evaluation framework for Generative AI models
☆158Updated this week
timaeus-research / devinterp
Tools for studying developmental interpretability in neural networks.
☆99Updated 3 weeks ago
JasonGross / guarantees-based-mechanistic-interpretability
☆14Updated 2 weeks ago
AgentTorch / AgentTorch
large population models
☆378Updated last week
longtermrisk / openweights
A python sdk for LLM finetuning and inference on runpod infrastructure
☆11Updated 2 weeks ago
callummcdougall / ARENA_2.0
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
☆217Updated last year
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
☆72Updated 2 years ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆115Updated 4 months ago
ndif-team / nnsight
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆608Updated last week
Giskard-AI / awesome-ai-safety
📚 A curated list of papers & technical articles on AI Quality & Safety
☆188Updated 3 months ago
saprmarks / dictionary_learning
☆315Updated this week
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆193Updated 8 months ago
google-deepmind / mishax
☆134Updated 3 months ago
safety-research / safety-tooling
Inference API for many LLMs and other useful tools for empirical research
☆52Updated this week