FOR-sight-ai / interpretoLinks
πͺ Interpreto is an interpretability toolbox for LLMs
β71Updated last week
Alternatives and similar repositories for interpreto
Users that are interested in interpreto are comparing it to the libraries listed below
Sorting:
- π Overcomplete is a Vision-based SAE Toolboxβ106Updated last week
- Unified access to Large Language Model modules using NNsightβ68Updated 3 weeks ago
- Attribution-based Parameter Decompositionβ33Updated 6 months ago
- β36Updated last year
- Sparse Autoencoder Training Libraryβ55Updated 7 months ago
- Steering vectors for transformer language models in Pytorch / Huggingfaceβ130Updated 9 months ago
- Modified to support crosscoder training.β24Updated 2 months ago
- Repository for PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits, accepted at CVPR 2024 XAI4CV Worksβ¦β19Updated last year
- β83Updated 9 months ago
- β144Updated 3 months ago
- Tools for studying developmental interpretability in neural networks.β117Updated 5 months ago
- Tools for optimizing steering vectors in LLMs.β15Updated 8 months ago
- Engine for collecting, uploading, and downloading model activationsβ24Updated 8 months ago
- Universal Neurons in GPT2 Language Modelsβ31Updated last year
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).β232Updated 11 months ago
- Official codebase for "Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions" (Matrenok β¦β27Updated this week
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models β¦β231Updated last week
- A toolkit that provides a range of model diffing techniques including a UI to visualize them interactively.β43Updated last week
- π Influenciae is a Tensorflow Toolbox for Influence Functionsβ64Updated last year
- ControlArena is a collection of settings, model organisms and protocols - for running control experiments.β134Updated this week
- Open source replication of Anthropic's Crosscoders for Model Diffingβ63Updated last year
- Mechanistic Interpretability Visualizations using Reactβ302Updated 11 months ago
- β259Updated last year
- β20Updated last month
- Erasing concepts from neural representations with provable guaranteesβ239Updated 10 months ago
- Flexible library for merging large language models (LLMs) via evolutionary optimization (ACL 2025 Demo).β93Updated 4 months ago
- β111Updated 10 months ago
- A library for efficient patching and automatic circuit discovery.β80Updated 4 months ago
- β136Updated 3 weeks ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from eβ¦β28Updated last year