rgreenblatt / control-evaluationsLinks
☆18Updated last year
Alternatives and similar repositories for control-evaluations
Users that are interested in control-evaluations are comparing it to the libraries listed below
Sorting:
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆115Updated last week
 - Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆120Updated last year
 - Improving Steering Vectors by Targeting Sparse Autoencoder Features☆25Updated 11 months ago
 - Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆118Updated last year
 - ☆19Updated last week
 - ☆17Updated 11 months ago
 - ☆114Updated 2 weeks ago
 - ☆61Updated last month
 - METR Task Standard☆163Updated 9 months ago
 - ☆79Updated 3 weeks ago
 - ☆75Updated 2 years ago
 - Code repo for the model organisms and convergent directions of EM papers.☆36Updated last month
 - Code to break Llama Guard☆32Updated last year
 - ☆138Updated 3 months ago
 - ☆32Updated 8 months ago
 - ☆31Updated 2 years ago
 - Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆136Updated 4 months ago
 - Collection of evals for Inspect AI☆272Updated this week
 - Steering vectors for transformer language models in Pytorch / Huggingface☆127Updated 8 months ago
 - Code for Preventing Language Models From Hiding Their Reasoning, which evaluates defenses against LLM steganography.☆24Updated last year
 - Open source replication of Anthropic's Crosscoders for Model Diffing☆59Updated last year
 - Inference API for many LLMs and other useful tools for empirical research☆77Updated last week
 - Improving Alignment and Robustness with Circuit Breakers☆238Updated last year
 - Adversarial Attacks on GPT-4 via Simple Random Search [Dec 2023]☆43Updated last year
 - Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆28Updated last year
 - Sparse Autoencoder Training Library☆55Updated 6 months ago
 - A library for efficient patching and automatic circuit discovery.☆78Updated 3 months ago
 - datasets from the paper "Towards Understanding Sycophancy in Language Models"☆95Updated 2 years ago
 - ⚓️ Repository for the "Thought Anchors: Which LLM Reasoning Steps Matter?" paper.☆88Updated last week
 - Tools for optimizing steering vectors in LLMs.☆14Updated 6 months ago