anthropic-experimental / agentic-misalignmentLinks

☆562

Alternatives and similar repositories for agentic-misalignment

Users that are interested in agentic-misalignment are comparing it to the libraries listed below

Sorting:

safety-research / petri
An alignment auditing agent capable of quickly exploring alignment hypothesis
☆874Updated last week
anthropic-experimental / automated-auditing
Prompts used in the Automated Auditing Blog Post
☆137Updated 6 months ago
emergent-misalignment / emergent-misalignment
☆259Updated 3 weeks ago
UKGovernmentBEIS / inspect_evals
Collection of evals for Inspect AI
☆357Updated this week
laude-institute / harbor
Harbor is a framework for running agent evaluations and creating and using RL environments.
☆542Updated this week
arcprize / arc-agi-benchmarking
Testing baseline LLMs performance across various models
☆336Updated this week
haizelabs / verdict
Inference-time scaling for LLMs-as-a-judge.
☆328Updated 3 months ago
chroma-core / context-rot
This repository contains the toolkit for replicating results from our technical report.
☆200Updated 5 months ago
goodfire-ai / r1-interpretability
Open source interpretability artefacts for R1.
☆170Updated 9 months ago
hijohnnylin / neuronpedia
open source interpretability platform 🧠
☆689Updated this week
obalcells / hallucination_probes
Real-Time Detection of Hallucinated Entities in Long-Form Generation
☆278Updated 2 months ago
METR / vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
☆133Updated last week
haizelabs / dspy-redteam
Red-Teaming Language Models with DSPy
☆250Updated 11 months ago
haizelabs / Awesome-LLM-Judges
⚖️ Awesome LLM Judges ⚖️
☆161Updated 9 months ago
PrimeIntellect-ai / community-environments
Lightly-reviewed collection of community environments
☆210Updated last week
SWE-bench / SWE-smith
[NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents
☆538Updated this week
princeton-pli / hal-harness
☆223Updated this week
aw31 / openai-imo-2025-proofs
☆483Updated 6 months ago
TheAgentCompany / TheAgentCompany
An agent benchmark with tasks in a simulated software company.
☆635Updated 2 months ago
NousResearch / atropos
Atropos is a Language Model Reinforcement Learning Environments framework for collecting and evaluating LLM trajectories through diverse …
☆853Updated this week
alexzhang13 / rlm-minimal
Super basic implementation (gist-like) of RLMs with REPL environments.
☆592Updated last month
TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆228Updated last month
METR / task-standard
METR Task Standard
☆173Updated last year
METR / eval-analysis-public
Public repository containing METR's DVC pipeline for eval data analysis
☆199Updated last week
SakanaAI / treequest
A Tree Search Library with Flexible API for LLM Inference-Time Scaling
☆515Updated 2 months ago
safety-research / persona_vectors
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
☆348Updated 6 months ago
MinhxLe / subliminal-learning
☆107Updated last week
simple-bench / SimpleBench
☆190Updated last year
anthropics / sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
☆127Updated last year
huggingface / gpt-oss-recipes
Collection of scripts and notebooks for OpenAI's latest GPT OSS models
☆496Updated 5 months ago