safety-research/SHADE-Arena

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/safety-research/SHADE-Arena)

safety-research / SHADE-Arena

☆26

Alternatives and similar repositories for SHADE-Arena

Users that are interested in SHADE-Arena are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

jkutaso / SHADE-Arena
View on GitHub
☆57May 9, 2025Updated last year
jplhughes / dotfiles
View on GitHub
Easily deploy my zsh and tmux configuration on new machines. Includes local and remote aliases to improve workflow.
☆15Apr 23, 2026Updated 3 months ago
Jiaxin-Wen / Unsupervised-Elicitation
View on GitHub
☆41Jul 6, 2025Updated last year
safety-research / open-source-alignment-faking
View on GitHub
Open Source Replication of Anthropic's Alignment Faking Paper
☆58Apr 4, 2025Updated last year
Beneficial-AI-Foundation / dafny-autopilot
View on GitHub
☆14Jun 11, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
YuejiangLIU / csl
View on GitHub
Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts
☆15Feb 26, 2024Updated 2 years ago
zmsn-2077 / CUP-safe-rl
View on GitHub
NeurIPS2022: Constrained Update Projection Approach to Safe Policy Optimization
☆13Apr 10, 2023Updated 3 years ago
aboustati / vargrad
View on GitHub
Code accompanying VarGrad: A Low-Variance Gradient Estimator for Variational Inference
☆12Oct 12, 2020Updated 5 years ago
jxnl / multi-agent-researcher
View on GitHub
Multi-agent research system using Instructor for structured LLM outputs and Exa.ai for neural search
☆17Jun 15, 2025Updated last year
rgreenblatt / model_organism_public
View on GitHub
☆15Jun 17, 2025Updated last year
JoshEngels / SAE-Dark-Matter
View on GitHub
Code for our paper "Decomposing The Dark Matter of Sparse Autoencoders"
☆23Feb 6, 2025Updated last year
sun-wendy / DafnyBench
View on GitHub
DafnyBench: A Benchmark for Formal Software Verification
☆67Dec 12, 2024Updated last year
PKU-Alignment / eval-anything
View on GitHub
☆22Jul 26, 2025Updated 11 months ago
xhan77 / in-context-alignment
View on GitHub
In-Context Alignment: Chat with Vanilla Language Models Before Fine-Tuning
☆34Aug 9, 2023Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
KihoPark / linear_rep_geometry
View on GitHub
Code for 'The Linear Representation Hypothesis and the Geometry of Large Language Models' (ICML 2024)
☆125Feb 11, 2025Updated last year
ordavid-s / snmf-mlp-decomposition
View on GitHub
☆15Jul 7, 2026Updated 2 weeks ago
EleutherAI / deep-ignorance
View on GitHub
☆20Jan 7, 2026Updated 6 months ago
antra-tess / chapterx
View on GitHub
☆26Jul 1, 2026Updated 3 weeks ago
ybisk / CCG-Induction
View on GitHub
Unsupervised Grammar Induction with Combinatory Categorial Grammars
☆10Jan 28, 2021Updated 5 years ago
marcus-jw / Targeted-Manipulation-and-Deception-in-LLMs
View on GitHub
Codebase for "On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback". This repo implements a generative multi-tur…
☆25Dec 3, 2024Updated last year
Harvard-CS-2881 / harvard-cs-2881-hw0
View on GitHub
harvard-cs-2881-classroom-hw0-c2881-hw0 created by GitHub Classroom
☆16Jul 26, 2025Updated 11 months ago
dylanmeysmans / FSharp.Data.Tdms
View on GitHub
TDMS 2.0 support for F# and C#
☆13Dec 26, 2022Updated 3 years ago
gautierdag / plancraft
View on GitHub
Plancraft is a minecraft environment and agent suite to test planning capabilities in LLMs
☆30Nov 7, 2025Updated 8 months ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
anthropic-experimental / agentic-misalignment
View on GitHub
☆643Jun 19, 2025Updated last year
HanjiangHu / NBF-LLM
View on GitHub
The official code for "Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks".
☆18Jun 24, 2026Updated last month
DavisPL / PCCC
View on GitHub
Proof-carrying code completions in Dafny
☆11Apr 4, 2025Updated last year
allenai / understanding_mcqa
View on GitHub
Code for the arXiv preprint "Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions"
☆15Aug 2, 2025Updated 11 months ago
Shentao-YANG / Dense_Reward_T2I
View on GitHub
Source code for "A Dense Reward View on Aligning Text-to-Image Diffusion with Preference" (ICML'24).
☆39May 9, 2024Updated 2 years ago
Cadenza-Labs / sleeper-agents
View on GitHub
☆15Jul 12, 2024Updated 2 years ago
LAVA-LAB / safe-slac
View on GitHub
Safe SLAC, an algorithm for safe cost-constrained reinforcement learning in high-dimensional POMDPs.
☆11Mar 1, 2023Updated 3 years ago
PKU-Alignment / aligner
View on GitHub
[NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct
☆194Jan 16, 2025Updated last year
harvardnlp / strux
View on GitHub
☆18Mar 20, 2022Updated 4 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
mlubin / jlSimplex
View on GitHub
Proof-of-concept implementation of the (dual) simplex algorithm for linear programming in Julia.
☆40Mar 24, 2013Updated 13 years ago
TeunvdWeij / sandbagging
View on GitHub
☆20Nov 15, 2024Updated last year
thu-ml / MLA-Trust
View on GitHub
A toolbox for benchmarking Multimodal LLM Agents trustworthiness across truthfulness, controllability, safety and privacy dimensions thro…
☆63Jan 9, 2026Updated 6 months ago
agarwalishika / DELIFT
View on GitHub
☆16Feb 21, 2025Updated last year
tianyu139 / tangent-model-composition
View on GitHub
Code for Tangent Model Composition for Ensembling and Continual Fine-tuning (ICCV 2023) and Tangent Transformers for Composition, Privacy…
☆14May 14, 2024Updated 2 years ago
kid-yang233 / robots
View on GitHub
The homework of robos learning base.
☆11May 23, 2023Updated 3 years ago
ethz-spylab / rlhf_trojan_competition
View on GitHub
Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.
☆119Jun 13, 2024Updated 2 years ago