aogara-ds / hoodwinked-website
A text-based game where language models learn to lie and to detect lies.
☆11Updated last year
Related projects ⓘ
Alternatives and complementary repositories for hoodwinked-website
- Algebraic value editing in pretrained language models☆57Updated last year
- ☆188Updated last month
- A library for efficient patching and automatic circuit discovery.☆31Updated last month
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆26Updated 5 months ago
- ☆105Updated last month
- ☆98Updated 3 months ago
- NeuroSurgeon is a package that enables researchers to uncover and manipulate subnetworks within models in Huggingface Transformers☆36Updated 3 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆57Updated 2 weeks ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆78Updated last year
- Steering Llama 2 with Contrastive Activation Addition☆98Updated 5 months ago
- Experiments with representation engineering☆10Updated 8 months ago
- Sparse probing paper full code.☆51Updated 11 months ago
- ☆29Updated 7 months ago
- Improving Alignment and Robustness with Circuit Breakers☆154Updated last month
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆63Updated 10 months ago
- ☆58Updated last year
- ☆49Updated last year
- Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State☆17Updated 10 months ago
- Sparse Autoencoder Training Library☆27Updated 3 weeks ago
- AI Logging for Interpretability and Explainability🔬☆89Updated 5 months ago
- Measuring the situational awareness of language models☆33Updated 9 months ago
- ☆28Updated last month
- (Model-written) LLM evals library☆16Updated 3 months ago
- ☆76Updated 9 months ago
- ☆44Updated this week
- Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]☆15Updated 6 months ago
- Finding trojans in aligned LLMs. Official repository for the competition hosted at SaTML 2024.☆107Updated 5 months ago
- ☆107Updated this week
- ☆13Updated 2 months ago
- ☆108Updated last year