emergent-misalignment / emergent-misalignment
☆137Updated last month
Alternatives and similar repositories for emergent-misalignment:
Users that are interested in emergent-misalignment are comparing it to the libraries listed below
- METR Task Standard☆146Updated 2 months ago
- Collection of evals for Inspect AI☆115Updated this week
- Vivaria is METR's tool for running evaluations and conducting agent elicitation research.☆90Updated this week
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆208Updated 6 months ago
- Improving Alignment and Robustness with Circuit Breakers☆197Updated 7 months ago
- A toolkit for describing model features and intervening on those features to steer behavior.☆178Updated 5 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆95Updated 2 months ago
- Open source interpretability artefacts for R1.☆82Updated this week
- ☆128Updated 3 weeks ago
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆139Updated this week
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆102Updated last year
- ☆89Updated last month
- Functional Benchmarks and the Reasoning Gap☆85Updated 6 months ago
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆169Updated this week
- ☆54Updated 7 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆70Updated 3 weeks ago
- Red-Teaming Language Models with DSPy☆183Updated 2 months ago
- Verdict is a library for scaling judge-time compute.☆199Updated last week
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆197Updated 4 months ago
- ☆64Updated this week
- WMDP is a LLM proxy benchmark for hazardous knowledge in bio, cyber, and chemical security. We also release code for RMU, an unlearning m…☆111Updated last year
- open source interpretability platform 🧠☆95Updated this week
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆187Updated 4 months ago
- PyTorch library for Active Fine-Tuning☆64Updated 2 months ago
- ⚖️ Awesome LLM Judges ⚖️☆93Updated 2 months ago
- Measuring the situational awareness of language models☆34Updated last year
- https://transformer-circuits.pub/2025/attribution-graphs/methods.html☆42Updated 3 weeks ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆91Updated last year
- Repository for the paper Stream of Search: Learning to Search in Language☆145Updated 2 months ago
- ☆108Updated 4 months ago