TeunvdWeij/sandbagging

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/TeunvdWeij/sandbagging)

TeunvdWeij / sandbagging

☆20

Alternatives and similar repositories for sandbagging

Users that are interested in sandbagging are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

rgreenblatt / model_organism_public
View on GitHub
☆15Jun 17, 2025Updated last year
EleutherAI / elk-generalization
View on GitHub
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆33May 23, 2024Updated 2 years ago
ApolloResearch / deception-detection
View on GitHub
☆44Feb 11, 2025Updated last year
alan-cooney / transformer-lens-starter-template
View on GitHub
A quick way to get started with Transformer Lens
☆14Dec 13, 2023Updated 2 years ago
ejnnr / cupbearer
View on GitHub
A library for mechanistic anomaly detection
☆22Jan 9, 2025Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
alan-cooney / transformer-from-scratch
View on GitHub
Decoder only transformer, built from scratch with PyTorch
☆33Oct 22, 2023Updated 2 years ago
ApolloResearch / sample
View on GitHub
Repository with sample code using Apollo's suggested engineering practices
☆15Dec 16, 2024Updated last year
LukeBailey181 / obfuscated-activations
View on GitHub
Codebase for Obfuscated Activations Bypass LLM Latent-Space Defenses
☆31Feb 11, 2025Updated last year
ajobi-uhc / seer
View on GitHub
This was designed for interp researchers who want to do research on or with interp agents to give quality of life improvements and fix …
☆146Feb 8, 2026Updated 5 months ago
thestephencasper / latent_adversarial_training
View on GitHub
☆24Jul 25, 2024Updated 2 years ago
AsaCooperStickland / situational-awareness-evals
View on GitHub
Measuring the situational awareness of language models
☆41Feb 12, 2024Updated 2 years ago
simple-stories / simple_stories_train
View on GitHub
Trains small LMs. Designed for training on SimpleStories
☆14Sep 15, 2025Updated 10 months ago
FlyingPumba / InterpBench
View on GitHub
A benchmark for mechanistic discovery of circuits in Transformers
☆17Dec 15, 2024Updated last year
yc015 / TalkTuner-chatbot-llm-dashboard
View on GitHub
Designing a Dashboard for Transparency and Control of Conversational AI, https://arxiv.org/abs/2406.07882
☆39Oct 7, 2025Updated 9 months ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
callummcdougall / sae_visualizer
View on GitHub
☆31Apr 4, 2024Updated 2 years ago
DanielPolatajko / inspect_wandb
View on GitHub
Integration between Inspect and Weights & Biases
☆24Updated this week
ag8 / sha-transformer
View on GitHub
☆12Jul 8, 2024Updated 2 years ago
jcmgray / einsum_bmm
View on GitHub
einsum via batch matrix multiply
☆15Nov 29, 2023Updated 2 years ago
EleutherAI / deep-ignorance
View on GitHub
☆20Jan 7, 2026Updated 6 months ago
longtermrisk / openweights
View on GitHub
A python sdk for LLM finetuning and inference on runpod infrastructure
☆30May 12, 2026Updated 2 months ago
Blkalkin / Optimal-TestTime
View on GitHub
☆10Mar 24, 2025Updated last year
noanabeshima / tinymodel
View on GitHub
A TinyStories LM with SAEs and transcoders
☆14Apr 3, 2025Updated last year
tim-hua-01 / steering-eval-awareness-public
View on GitHub
☆17Mar 16, 2026Updated 4 months ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
explanare / ravel
View on GitHub
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆58Oct 30, 2025Updated 8 months ago
amack315 / unsupervised-steering-vectors
View on GitHub
☆38Apr 30, 2024Updated 2 years ago
wicai24 / DOOR-Alignment
View on GitHub
☆20Apr 7, 2025Updated last year
RobertCsordas / onion_representations
View on GitHub
☆13Aug 19, 2024Updated last year
EleutherAI / training-jacobian
View on GitHub
☆24Dec 11, 2024Updated last year
IINemo / llm-uncertainty-head
View on GitHub
☆26Feb 23, 2026Updated 5 months ago
Jiaxin-Wen / MisleadLM
View on GitHub
Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""
☆20Oct 11, 2024Updated last year
safety-research / open-source-alignment-faking
View on GitHub
Open Source Replication of Anthropic's Alignment Faking Paper
☆58Apr 4, 2025Updated last year
XuchanBao / behavioral-self-awareness
View on GitHub
☆37Feb 20, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
cadentj / caft
View on GitHub
☆25Mar 30, 2026Updated 3 months ago
choidami / inductive-oocr
View on GitHub
☆16Mar 22, 2025Updated last year
zhu-minjun / SafetyLock
View on GitHub
Your finetuned model's back to its original safety standards faster than you can say "SafetyLock"!
☆11Oct 16, 2024Updated last year
y-z-zhang / SBD
View on GitHub
A simple algorithm that finds a simultaneous block diagonalization of multiple matrices through the eigendecomposition of a single matrix…
☆16Feb 24, 2026Updated 5 months ago
aengusl / latent-adversarial-training
View on GitHub
☆48Sep 29, 2024Updated last year
JoshEngels / MultiDimensionalFeatures
View on GitHub
Code for reproducing our paper "Not All Language Model Features Are Linear"
☆90Nov 27, 2024Updated last year
Zhou-Zoey / RMB-Reward-Model-Benchmark
View on GitHub
☆48Mar 25, 2025Updated last year