maxdreyer / PURELinks
Repository for PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits, accepted at CVPR 2024 XAI4CV Workshop (spotlight)
☆19Updated last year
Alternatives and similar repositories for PURE
Users that are interested in PURE are comparing it to the libraries listed below
Sorting:
- ☆48Updated last year
- Sparse Autoencoder Training Library☆54Updated 5 months ago
- NeuroSurgeon is a package that enables researchers to uncover and manipulate subnetworks within models in Huggingface Transformers☆41Updated 7 months ago
- ☆34Updated 2 years ago
- ☆125Updated 2 weeks ago
- 👋 Overcomplete is a Vision-based SAE Toolbox☆90Updated 2 months ago
- ☆187Updated 2 months ago
- Stanford NLP Python library for benchmarking the utility of LLM interpretability methods☆136Updated 3 months ago
- [ICML 24] A novel automated neuron explanation framework that can accurately describe poly-semantic concepts in deep neural networks☆13Updated 5 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆125Updated 7 months ago
- Improving Steering Vectors by Targeting Sparse Autoencoder Features☆24Updated 10 months ago
- ☆54Updated 10 months ago
- ☆106Updated 7 months ago
- Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery via Relevance Patching"☆15Updated last month
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆215Updated last week
- Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"☆41Updated last year
- What do we learn from inverting CLIP models?☆55Updated last year
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆61Updated 3 months ago
- ☆23Updated last year
- ☆15Updated 5 months ago
- Tools for optimizing steering vectors in LLMs.☆11Updated 5 months ago
- A tiny easily hackable implementation of a feature dashboard.☆15Updated 3 weeks ago
- Open source replication of Anthropic's Crosscoders for Model Diffing☆59Updated 11 months ago
- Source code of "Task arithmetic in the tangent space: Improved editing of pre-trained models".☆105Updated 2 years ago
- ☆43Updated last year
- Modified to support crosscoder training.☆23Updated 2 months ago
- A simple and efficient baseline for data attribution☆11Updated last year
- Evaluate interpretability methods on localizing and disentangling concepts in LLMs.☆56Updated last year
- Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features in ImageNet☆32Updated 2 years ago
- A library for efficient patching and automatic circuit discovery.☆77Updated 2 months ago