PAIR-code / interpretabilityLinks
PAIR.withgoogle.com and friend's work on interpretability methods
☆220Updated this week
Alternatives and similar repositories for interpretability
Users that are interested in interpretability are comparing it to the libraries listed below
Sorting:
- Utilities for the HuggingFace transformers library☆74Updated 3 years ago
- ☆132Updated 2 years ago
- Mechanistic Interpretability Visualizations using React☆318Updated last year
- Landing page for MIB: A Mechanistic Interpretability Benchmark☆24Updated 5 months ago
- ☆112Updated 11 months ago
- ☆267Updated last year
- Repository for research in the field of Responsible NLP at Meta.☆205Updated this week
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆102Updated 2 years ago
- ☆137Updated last year
- Evaluate interpretability methods on localizing and disentangling concepts in LLMs.☆57Updated 3 months ago
- Erasing concepts from neural representations with provable guarantees☆243Updated last year
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆241Updated last week
- Steering vectors for transformer language models in Pytorch / Huggingface☆140Updated 11 months ago
- A Python library that encapsulates various methods for neuron interpretation and analysis in Deep NLP models.☆106Updated 2 years ago
- ☆83Updated 11 months ago
- Materials for EACL2024 tutorial: Transformer-specific Interpretability☆63Updated last year
- ☆197Updated last year
- Tools for understanding how transformer predictions are built layer-by-layer☆567Updated 5 months ago
- The official code of LM-Debugger, an interactive tool for inspection and intervention in transformer-based language models.☆182Updated 3 years ago
- ☆143Updated last month
- Steering Llama 2 with Contrastive Activation Addition☆207Updated last year
- A library for efficient patching and automatic circuit discovery.☆88Updated last month
- Sparse probing paper full code.☆66Updated 2 years ago
- ☆99Updated last year
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆238Updated last year
- Mechanistic Interpretability for Transformer Models☆53Updated 3 years ago
- ☆245Updated last year
- ☆57Updated 2 years ago
- Using sparse coding to find distributed representations used by neural networks.☆293Updated 2 years ago
- Experiments with representation engineering☆13Updated last year