rgreenblatt / control-evaluationsLinks
☆20Updated last year
Alternatives and similar repositories for control-evaluations
Users that are interested in control-evaluations are comparing it to the libraries listed below
Sorting:
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆129Updated this week
- Improving Steering Vectors by Targeting Sparse Autoencoder Features☆24Updated last year
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆122Updated last year
- Steering vectors for transformer language models in Pytorch / Huggingface☆130Updated 9 months ago
- ☆17Updated last year
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆122Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆244Updated last year
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆65Updated 5 months ago
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆155Updated 6 months ago
- ☆31Updated 2 years ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆145Updated 5 months ago
- Tools for optimizing steering vectors in LLMs.☆15Updated 7 months ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆28Updated last year
- METR Task Standard☆168Updated 9 months ago
- ☆75Updated 2 years ago
- OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents [NeurIPS 2025 Spotlight]☆39Updated 2 months ago
- Code to break Llama Guard☆32Updated last year
- Open source replication of Anthropic's Crosscoders for Model Diffing☆62Updated last year
- ☆81Updated last month
- ☆142Updated 4 months ago
- ☆59Updated 2 years ago
- Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆116Updated 9 months ago
- ☆228Updated last month
- bloom - evaluate any behavior immediately 🌸🌱☆27Updated last week
- Code repo for the model organisms and convergent directions of EM papers.☆40Updated 2 months ago
- ☆196Updated last month
- ⚓️ Repository for the "Thought Anchors: Which LLM Reasoning Steps Matter?" paper.☆92Updated last month
- Collection of evals for Inspect AI☆290Updated this week
- ☆32Updated 9 months ago
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models☆287Updated 4 months ago