anthropic-experimental / agentic-misalignmentLinks
☆520Updated 5 months ago
Alternatives and similar repositories for agentic-misalignment
Users that are interested in agentic-misalignment are comparing it to the libraries listed below
Sorting:
- Prompts used in the Automated Auditing Blog Post☆123Updated 3 months ago
- ☆226Updated 3 weeks ago
- open source interpretability platform 🧠☆486Updated this week
- An alignment auditing agent capable of quickly exploring alignment hypothesis☆664Updated this week
- Inference-time scaling for LLMs-as-a-judge.☆310Updated 2 weeks ago
- Collection of evals for Inspect AI☆284Updated this week
- Open source interpretability artefacts for R1.☆163Updated 7 months ago
- ☆478Updated 4 months ago
- Testing baseline LLMs performance across various models☆317Updated last week
- Real-Time Detection of Hallucinated Entities in Long-Form Generation☆268Updated this week
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆120Updated last week
- Red-Teaming Language Models with DSPy☆235Updated 9 months ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆451Updated last week
- A toolkit for describing model features and intervening on those features to steer behavior.☆214Updated last year
- ⚖️ Awesome LLM Judges ⚖️☆133Updated 6 months ago
- An agent benchmark with tasks in a simulated software company.☆581Updated last month
- Training-Ready RL Environments + Evals☆174Updated this week
- ☆186Updated last week
- ☆174Updated 11 months ago
- This repository contains the toolkit for replicating results from our technical report.☆163Updated 2 months ago
- Provider-agnostic, open-source evaluation infrastructure for language models☆653Updated this week
- Collection of scripts and notebooks for OpenAI's latest GPT OSS models☆477Updated 2 months ago
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models☆284Updated 3 months ago
- Super basic implementation (gist-like) of RLMs with REPL environments.☆248Updated last month
- OSS RL environment + evals toolkit☆200Updated this week
- Public repository containing METR's DVC pipeline for eval data analysis☆129Updated 7 months ago
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆122Updated last year
- ☆106Updated last week
- Atropos is a Language Model Reinforcement Learning Environments framework for collecting and evaluating LLM trajectories through diverse …☆749Updated this week
- A subset of jailbreaks automatically discovered by the Haize Labs haizing suite.☆98Updated 7 months ago