jkutaso/SHADE-Arena

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/jkutaso/SHADE-Arena)

jkutaso / SHADE-Arena

☆57

Alternatives and similar repositories for SHADE-Arena

Users that are interested in SHADE-Arena are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

safety-research / SHADE-Arena
View on GitHub
☆26Jun 22, 2025Updated last year
scaleapi / mrt
View on GitHub
https://scale.com/research/mrt
☆20Mar 16, 2026Updated 4 months ago
UKGovernmentBEIS / control-arena
View on GitHub
ControlArena is a collection of settings, model organisms and protocols - for running control experiments.
☆213Updated this week
ASTRAL-Group / MonitorBench
View on GitHub
[COLM 2026] Official implementation for "MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Mo…
☆20Apr 23, 2026Updated 3 months ago
jplhughes / dotfiles
View on GitHub
Easily deploy my zsh and tmux configuration on new machines. Includes local and remote aliases to improve workflow.
☆15Apr 23, 2026Updated 3 months ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
TanqiuJiang / AgentLAB
View on GitHub
The official implementation of the paper "AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks"
☆27Jun 1, 2026Updated last month
KihoPark / linear_rep_geometry
View on GitHub
Code for 'The Linear Representation Hypothesis and the Geometry of Large Language Models' (ICML 2024)
☆125Feb 11, 2025Updated last year
safety-research / false-facts
View on GitHub
☆50Jul 4, 2025Updated last year
OSU-NLP-Group / RedTeamCUA
View on GitHub
[ICLR'26 Oral] RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
☆57Feb 9, 2026Updated 5 months ago
clarifying-EM / model-organisms-for-EM
View on GitHub
Code repo for the model organisms and convergent directions of EM papers.
☆72Sep 22, 2025Updated 10 months ago
ApolloResearch / deception-detection
View on GitHub
☆44Feb 11, 2025Updated last year
UCSB-AI / MSSBench
View on GitHub
[ICLR 2025] Official codebase for the ICLR 2025 paper "Multimodal Situational Safety"
☆36Jun 23, 2025Updated last year
YuejiangLIU / csl
View on GitHub
Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts
☆15Feb 26, 2024Updated 2 years ago
neelnanda-io / Neuroscope
View on GitHub
Accompanying codebase for neuroscope.io, a website for displaying max activating dataset examples for language model neurons
☆14Feb 13, 2023Updated 3 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
thu-coai / Agent-SafetyBench
View on GitHub
☆149Aug 11, 2025Updated 11 months ago
QingyuLiu / Agentic-Upward-Deception
View on GitHub
This repo is the official implementation of “Are Your Agents Upward Deceivers?”. The paper is accepted by ICML 2026.
☆24Dec 15, 2025Updated 7 months ago
agarwalishika / DELIFT
View on GitHub
☆16Feb 21, 2025Updated last year
zmsn-2077 / CUP-safe-rl
View on GitHub
NeurIPS2022: Constrained Update Projection Approach to Safe Policy Optimization
☆13Apr 10, 2023Updated 3 years ago
rgreenblatt / control-evaluations
View on GitHub
☆25May 25, 2024Updated 2 years ago
aboustati / vargrad
View on GitHub
Code accompanying VarGrad: A Low-Variance Gradient Estimator for Variational Inference
☆12Oct 12, 2020Updated 5 years ago
safety-research / auditing-agents
View on GitHub
☆28Jul 1, 2026Updated 3 weeks ago
arnab-api / romba
View on GitHub
Applies ROME and MEMIT on Mamba-S4 models
☆16Apr 5, 2024Updated 2 years ago
richardzhuang0412 / EmbedLLM
View on GitHub
Repo for EmbedLLM: Learning Compact Representations of Large Language Models
☆32Sep 25, 2025Updated 9 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
chanind / claude-auto-research-synthsaebench
View on GitHub
☆23Mar 11, 2026Updated 4 months ago
lalalamdbf / PLSE_IDRR
View on GitHub
The Code for the EMNLP 2023 main conference paper "Prompt-based Logical Semantics Enhancement for Implicit Discourse Relation Recognition…
☆13Dec 10, 2023Updated 2 years ago
nrimsky / LM-exp
View on GitHub
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆105Sep 21, 2023Updated 2 years ago
sun-wendy / DafnyBench
View on GitHub
DafnyBench: A Benchmark for Formal Software Verification
☆67Dec 12, 2024Updated last year
iwhwang / SelecMix
View on GitHub
SelecMix: Debiased Learning by Contradicting-pair Sampling (NeurIPS 2022)
☆13Jun 5, 2024Updated 2 years ago
PKU-Alignment / eval-anything
View on GitHub
☆22Jul 26, 2025Updated 11 months ago
davidbau / sidn-handbook
View on GitHub
The Structure and Interpretation of Deep Networks Handbook
☆14Dec 14, 2024Updated last year
xhan77 / in-context-alignment
View on GitHub
In-Context Alignment: Chat with Vanilla Language Models Before Fine-Tuning
☆34Aug 9, 2023Updated 2 years ago
UKGovernmentBEIS / as-evaluation-standard
View on GitHub
A repository that holds templates, examples, and tests to help external parties submit tasks to AISI that conform with the Autonomous Sys…
☆11Jan 16, 2026Updated 6 months ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
ryoungj / ToolEmu
View on GitHub
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
☆214Mar 22, 2024Updated 2 years ago
SalesforceAIResearch / BingoGuard
View on GitHub
☆15Jun 2, 2026Updated last month
cywinski / eliciting-secret-knowledge
View on GitHub
Code repository for "Eliciting Secret Knowledge from Language Models"
☆23Mar 30, 2026Updated 3 months ago
annahdo / implementing_activation_steering
View on GitHub
A collection of different ways to implement accessing and modifying internal model activations for LLMs
☆24Oct 18, 2024Updated last year
safety-research / safety-examples
View on GitHub
☆31Nov 11, 2025Updated 8 months ago
safety-research / believe-it-or-not
View on GitHub
Code and data for editing model beliefs with SDF and other methods, and for evaluating the depth of the implanted beliefs.
☆16Oct 23, 2025Updated 9 months ago
JHU-CLSP / rockfish-tutorial
View on GitHub
☆10Mar 5, 2023Updated 3 years ago