TransluceAI / observatory
A toolkit for describing model features and intervening on those features to steer behavior.
☆172Updated 5 months ago
Alternatives and similar repositories for observatory:
Users that are interested in observatory are comparing it to the libraries listed below
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆166Updated this week
- Mechanistic Interpretability Visualizations using React☆239Updated 4 months ago
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆206Updated 6 months ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆194Updated 4 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆94Updated last month
- ☆91Updated 5 months ago
- ☆142Updated 4 months ago
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆108Updated last week
- ☆128Updated 2 weeks ago
- ☆83Updated this week
- Code for reproducing our paper "Not All Language Model Features Are Linear"☆73Updated 4 months ago
- ☆115Updated 8 months ago
- ☆217Updated 6 months ago
- ☆165Updated last month
- Using sparse coding to find distributed representations used by neural networks.☆230Updated last year
- Improving Alignment and Robustness with Circuit Breakers☆196Updated 6 months ago
- ☆159Updated last week
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆74Updated last year
- Open source replication of Anthropic's Crosscoders for Model Diffing☆49Updated 5 months ago
- A library for efficient patching and automatic circuit discovery.☆62Updated 2 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆63Updated 3 weeks ago
- Repository for the paper Stream of Search: Learning to Search in Language☆144Updated 2 months ago
- Sparsify transformers with SAEs and transcoders☆515Updated this week
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformers☆138Updated last month
- A simple unified framework for evaluating LLMs☆210Updated last week
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆92Updated last year
- Functional Benchmarks and the Reasoning Gap☆85Updated 6 months ago
- open source interpretability platform 🧠☆85Updated this week
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆167Updated last month
- Steering Llama 2 with Contrastive Activation Addition☆137Updated 10 months ago