rgreenblatt / control-evaluationsLinks
☆15Updated last year
Alternatives and similar repositories for control-evaluations
Users that are interested in control-evaluations are comparing it to the libraries listed below
Sorting:
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆76Updated this week
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆109Updated last year
- Collection of evals for Inspect AI☆186Updated this week
- ☆72Updated 2 years ago
- METR Task Standard☆154Updated 5 months ago
- Improving Steering Vectors by Targeting Sparse Autoencoder Features☆24Updated 8 months ago
- ☆69Updated last month
- ☆171Updated 4 months ago
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆59Updated last month
- ☆231Updated 9 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆117Updated 5 months ago
- ☆92Updated 2 months ago
- ☆11Updated last year
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆112Updated last year
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆113Updated last year
- Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.☆217Updated last year
- Inference API for many LLMs and other useful tools for empirical research☆52Updated last week
- ☆137Updated 8 months ago
- This repository collects all relevant resources about interpretability in LLMs☆365Updated 8 months ago
- ☆55Updated 9 months ago
- ☆181Updated last week
- Mechanistic Interpretability Visualizations using React☆265Updated 7 months ago
- ☆45Updated 11 months ago
- ☆99Updated 4 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆95Updated last year
- ☆16Updated 3 weeks ago
- Formal Contracts for Multi-Agent Reinforcement Learning☆19Updated last year
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆243Updated last month
- Repository for PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits, accepted at CVPR 2024 XAI4CV Works…☆16Updated last year
- ☆148Updated 8 months ago