anthropic-experimental / agentic-misalignmentLinks
☆406Updated 2 months ago
Alternatives and similar repositories for agentic-misalignment
Users that are interested in agentic-misalignment are comparing it to the libraries listed below
Sorting:
- ☆198Updated 5 months ago
- Testing baseline LLMs performance across various models☆305Updated 3 weeks ago
- Inference-time scaling for LLMs-as-a-judge.☆283Updated last month
- Prompts used in the Automated Auditing Blog Post☆89Updated last month
- ☆474Updated last month
- open source interpretability platform 🧠☆356Updated this week
- Collection of evals for Inspect AI☆211Updated this week
- ⚖️ Awesome LLM Judges ⚖️☆124Updated 4 months ago
- Open source interpretability artefacts for R1.☆158Updated 4 months ago
- Training teachers with reinforcement learning able to make LLMs learn how to reason for test time scaling.☆329Updated 2 months ago
- ☆119Updated 2 weeks ago
- An agent benchmark with tasks in a simulated software company.☆534Updated this week
- ☆133Updated 5 months ago
- A Tree Search Library with Flexible API for LLM Inference-Time Scaling☆449Updated 3 weeks ago
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆108Updated last week
- Public repository containing METR's DVC pipeline for eval data analysis☆104Updated 4 months ago
- ☆161Updated 8 months ago
- Scaling Data for SWE-agents☆378Updated this week
- Releases from OpenAI Preparedness☆846Updated this week
- Frontier Models playing the board game Diplomacy.☆565Updated last week
- Atropos is a Language Model Reinforcement Learning Environments framework for collecting and evaluating LLM trajectories through diverse …☆591Updated last week
- Collection of scripts and notebooks for OpenAI's latest GPT OSS models☆398Updated 2 weeks ago
- Red-Teaming Language Models with DSPy☆212Updated 6 months ago
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models☆200Updated 3 weeks ago
- Sandboxed code execution for AI agents, locally or on the cloud. Massively parallel, easy to extend. Powering SWE-agent and more.☆296Updated this week
- The State Of The Art, intelligence☆151Updated 2 weeks ago
- Letting Claude Code develop his own MCP tools :)☆122Updated 5 months ago
- This repository contains the toolkit for replicating results from our technical report.☆103Updated last month
- A toolkit for describing model features and intervening on those features to steer behavior.☆198Updated 9 months ago
- A comprehensive repository of reasoning tasks for LLMs (and beyond)☆450Updated 11 months ago