anthropic-experimental / agentic-misalignmentLinks
☆562Updated 7 months ago
Alternatives and similar repositories for agentic-misalignment
Users that are interested in agentic-misalignment are comparing it to the libraries listed below
Sorting:
- Prompts used in the Automated Auditing Blog Post☆137Updated 6 months ago
- An alignment auditing agent capable of quickly exploring alignment hypothesis☆874Updated last week
- ☆259Updated 3 weeks ago
- open source interpretability platform 🧠☆689Updated this week
- Inference-time scaling for LLMs-as-a-judge.☆328Updated 3 months ago
- Collection of evals for Inspect AI☆357Updated this week
- Harbor is a framework for running agent evaluations and creating and using RL environments.☆542Updated this week
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆133Updated last week
- Open source interpretability artefacts for R1.☆170Updated 9 months ago
- ⚖️ Awesome LLM Judges ⚖️☆161Updated 9 months ago
- ☆223Updated this week
- Real-Time Detection of Hallucinated Entities in Long-Form Generation☆278Updated 2 months ago
- Lightly-reviewed collection of community environments☆210Updated last week
- Testing baseline LLMs performance across various models☆336Updated this week
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆153Updated this week
- Atropos is a Language Model Reinforcement Learning Environments framework for collecting and evaluating LLM trajectories through diverse …☆851Updated this week
- Public repository containing METR's DVC pipeline for eval data analysis☆199Updated last week
- This repository contains the toolkit for replicating results from our technical report.☆200Updated 5 months ago
- Red-Teaming Language Models with DSPy☆250Updated 11 months ago
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models☆348Updated 6 months ago
- Super basic implementation (gist-like) of RLMs with REPL environments.☆592Updated last month
- METR Task Standard☆173Updated last year
- Inference API for many LLMs and other useful tools for empirical research☆104Updated this week
- ☆118Updated 3 weeks ago
- A Text-Based Environment for Interactive Debugging☆293Updated last week
- ☆483Updated 6 months ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆538Updated this week
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆124Updated last year
- A toolkit for describing model features and intervening on those features to steer behavior.☆228Updated last month
- Collection of scripts and notebooks for OpenAI's latest GPT OSS models☆496Updated 5 months ago