Aleph-Alpha / AtMan
☆29Updated last year
Alternatives and similar repositories for AtMan:
Users that are interested in AtMan are comparing it to the libraries listed below
- Steering Llama 2 with Contrastive Activation Addition☆120Updated 8 months ago
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆204Updated 8 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆83Updated 2 months ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆26Updated 8 months ago
- ☆109Updated 5 months ago
- Functional Benchmarks and the Reasoning Gap☆82Updated 3 months ago
- ☆143Updated last week
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆92Updated 10 months ago
- ☆140Updated this week
- Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"☆64Updated 7 months ago
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆177Updated last month
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆165Updated 3 months ago
- Repository for the paper Stream of Search: Learning to Search in Language☆126Updated 5 months ago
- The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"☆108Updated last year
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆84Updated last year
- Code accompanying the paper Pretraining Language Models with Human Preferences☆180Updated 11 months ago
- ☆118Updated 2 weeks ago
- Mechanistic Interpretability Visualizations using React☆224Updated last month
- ☆10Updated 6 months ago
- Just a bunch of benchmark logs for different LLMs☆117Updated 6 months ago
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆88Updated this week
- Python Client for the Aleph Alpha API☆92Updated this week
- Code to reproduce "Transformers Can Do Arithmetic with the Right Embeddings", McLeish et al (NeurIPS 2024)☆183Updated 8 months ago
- Replicating O1 inference-time scaling laws☆73Updated last month
- ☆202Updated 3 months ago
- Measuring the situational awareness of language models☆33Updated 11 months ago
- Learning to Compress Prompts with Gist Tokens - https://arxiv.org/abs/2304.08467☆274Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆176Updated 4 months ago
- Inspecting and Editing Knowledge Representations in Language Models☆112Updated last year
- ☆54Updated 2 months ago