AsaCooperStickland / situational-awareness-evalsView external linksLinks
Measuring the situational awareness of language models
☆40Feb 12, 2024Updated 2 years ago
Alternatives and similar repositories for situational-awareness-evals
Users that are interested in situational-awareness-evals are comparing it to the libraries listed below
Sorting:
- ☆16Mar 22, 2025Updated 10 months ago
- ☆20Nov 15, 2024Updated last year
- ☆16Apr 7, 2025Updated 10 months ago
- A quick way to get started with Transformer Lens☆14Dec 13, 2023Updated 2 years ago
- [ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"☆14Jun 21, 2024Updated last year
- Repository with sample code using Apollo's suggested engineering practices☆15Dec 16, 2024Updated last year
- A TinyStories LM with SAEs and transcoders☆14Apr 3, 2025Updated 10 months ago
- Tools for studying developmental interpretability in neural networks.☆126Dec 30, 2025Updated last month
- ☆17Dec 21, 2023Updated 2 years ago
- (Model-written) LLM evals library☆18Dec 13, 2024Updated last year
- ☆29Nov 9, 2025Updated 3 months ago
- Sparse Autoencoder Training Library☆56May 1, 2025Updated 9 months ago
- ☆306Nov 17, 2023Updated 2 years ago
- Reinforcement Learning Replications is a set of Pytorch implementations of reinforcement learning algorithms.☆25Dec 15, 2024Updated last year
- Repository for "Propagating Knowledge Updates to LMs Through Distillation" (NeurIPS 2023).☆26Aug 25, 2024Updated last year
- Exploring the Limitations of Large Language Models on Multi-Hop Queries☆32Mar 2, 2025Updated 11 months ago
- ☆34Feb 20, 2025Updated 11 months ago
- A library of techniques for local interpretation of machine learning models☆10Mar 24, 2023Updated 2 years ago
- ☆83Oct 8, 2025Updated 4 months ago
- Squiggle programming language for intuitive probabilistic estimation features in Python☆81Jan 23, 2026Updated 3 weeks ago
- Situational Awareness Dataset☆43Dec 14, 2024Updated last year
- A dataset of alignment research and code to reproduce it☆78Jun 22, 2023Updated 2 years ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆103Oct 25, 2023Updated 2 years ago
- ☆51Oct 23, 2023Updated 2 years ago
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.☆153Updated this week
- Protect Your Secrets. Forever. Ultra-secure notes powered by blockchain.☆13Apr 29, 2025Updated 9 months ago
- Show Window proxy settings☆16Oct 19, 2016Updated 9 years ago
- ☆13Dec 21, 2025Updated last month
- my profile readme☆14Updated this week
- ☆10Mar 30, 2023Updated 2 years ago
- TOD-Flow: Modeling the Structure of Task-Oriented Dialogues☆13Feb 7, 2024Updated 2 years ago
- A Tree-LSTM-based dependency tree sentiment labeler☆15May 9, 2019Updated 6 years ago
- Linear Relational Embeddings (LREs) and Linear Relational Concepts (LRCs) for LLMs in PyTorch☆10Aug 7, 2024Updated last year
- ☆11Oct 24, 2022Updated 3 years ago
- Web game, clone of Chrome's dinosaur game.☆10Jan 5, 2023Updated 3 years ago
- Probabilistic inference for models of behaviour☆10Oct 13, 2025Updated 4 months ago
- A Cython library to solve the Bittensor registration POW on CUDA☆15Aug 15, 2025Updated 6 months ago
- ☆38Oct 2, 2024Updated last year
- [NAACL 2024] Vision language model that reduces hallucinations through self-feedback guided revision. Visualizes attentions on image feat…☆47Aug 21, 2024Updated last year