rgreenblatt / control-evaluationsLinks
☆12Updated last year
Alternatives and similar repositories for control-evaluations
Users that are interested in control-evaluations are comparing it to the libraries listed below
Sorting:
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆108Updated last year
- ControlArena is a suite of realistic settings, mimicking complex deployment environments, for running control evaluations. This is an alp…☆69Updated this week
- ☆66Updated last month
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆109Updated last year
- ☆55Updated 9 months ago
- METR Task Standard☆151Updated 4 months ago
- ☆11Updated 11 months ago
- Collection of evals for Inspect AI☆167Updated this week
- Code for Preventing Language Models From Hiding Their Reasoning, which evaluates defenses against LLM steganography.☆22Updated last year
- ☆87Updated 2 months ago
- ☆134Updated 7 months ago
- ☆175Updated 2 months ago
- Improving Steering Vectors by Targeting Sparse Autoencoder Features☆21Updated 7 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆110Updated 4 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆59Updated 2 weeks ago
- ☆98Updated 3 months ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆27Updated last year
- Formal Contracts for Multi-Agent Reinforcement Learning☆17Updated last year
- Inference API for many LLMs and other useful tools for empirical research☆49Updated last week
- Open source replication of Anthropic's Crosscoders for Model Diffing☆55Updated 8 months ago
- Open source interpretability artefacts for R1.☆149Updated 2 months ago
- A library for efficient patching and automatic circuit discovery.☆67Updated 2 months ago
- ☆22Updated 3 weeks ago
- ☆71Updated 2 years ago
- ☆27Updated 4 months ago
- ☆136Updated 7 months ago
- ☆31Updated last year
- Steering Llama 2 with Contrastive Activation Addition☆159Updated last year
- ☆16Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆214Updated 9 months ago