Aleph-Alpha-Research / AtManLinks
☆28Updated 3 weeks ago
Alternatives and similar repositories for AtMan
Users that are interested in AtMan are comparing it to the libraries listed below
Sorting:
- Steering vectors for transformer language models in Pytorch / Huggingface☆103Updated 3 months ago
- Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning☆46Updated last year
- [ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets☆216Updated last year
- Steering Llama 2 with Contrastive Activation Addition☆155Updated last year
- Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"☆70Updated 11 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆93Updated last year
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆105Updated last year
- Code accompanying the paper Pretraining Language Models with Human Preferences☆182Updated last year
- ☆69Updated last year
- ☆76Updated last month
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆193Updated 6 months ago
- The Official Repository for "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models"☆108Updated last year
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆205Updated this week
- ☆96Updated 3 months ago
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆181Updated this week
- Evaluating LLMs with fewer examples☆155Updated last year
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆200Updated 5 months ago
- ☆234Updated 2 months ago
- ☆54Updated 8 months ago
- ☆274Updated 11 months ago
- ☆223Updated 8 months ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆76Updated last year
- A library for efficient patching and automatic circuit discovery.☆65Updated last month
- Small and Efficient Mathematical Reasoning LLMs☆71Updated last year
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆107Updated last year
- Experiments with representation engineering☆11Updated last year
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆225Updated 8 months ago
- The GitHub repo for Goal Driven Discovery of Distributional Differences via Language Descriptions☆70Updated 2 years ago
- ☆95Updated last year
- Mechanistic Interpretability Visualizations using React☆253Updated 5 months ago