EleutherAI / concept-erasure
Erasing concepts from neural representations with provable guarantees
☆221Updated this week
Alternatives and similar repositories for concept-erasure:
Users that are interested in concept-erasure are comparing it to the libraries listed below
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆177Updated last month
- Steering vectors for transformer language models in Pytorch / Huggingface☆81Updated 2 months ago
- ☆116Updated last year
- Mechanistic Interpretability Visualizations using React☆223Updated last month
- ☆139Updated this week
- ☆117Updated last week
- ☆202Updated 3 months ago
- Sparse autoencoders☆414Updated last week
- Extract full next-token probabilities via language model APIs☆229Updated 11 months ago
- ☆25Updated 9 months ago
- ☆45Updated this week
- Keeping language models honest by directly eliciting knowledge encoded in their activations.☆193Updated this week
- Code to reproduce "Transformers Can Do Arithmetic with the Right Embeddings", McLeish et al (NeurIPS 2024)☆183Updated 8 months ago
- TART: A plug-and-play Transformer module for task-agnostic reasoning☆194Updated last year
- Utilities for the HuggingFace transformers library☆64Updated 2 years ago
- git extension for {collaborative, communal, continual} model development☆207Updated 2 months ago
- ☆54Updated 2 months ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆91Updated 2 months ago
- Tools for understanding how transformer predictions are built layer-by-layer☆461Updated 7 months ago
- ☆261Updated 10 months ago
- ☆220Updated 2 weeks ago
- ☆81Updated 3 months ago
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆163Updated 3 months ago
- Steering Llama 2 with Contrastive Activation Addition☆119Updated 8 months ago
- A library for bridging Python and HTML/Javascript (via Svelte) for creating interactive visualizations☆14Updated 9 months ago
- Understand and test language model architectures on synthetic tasks.☆177Updated 2 weeks ago
- Emergent world representations: Exploring a sequence model trained on a synthetic task☆174Updated last year
- 🧠 Starter templates for doing interpretability research☆65Updated last year
- Training Sparse Autoencoders on Language Models☆594Updated this week
- Using sparse coding to find distributed representations used by neural networks.☆210Updated last year