JoshEngels / SAE-Dark-Matter
Code for our paper "Decomposing The Dark Matter of Sparse Autoencoders"
☆18Updated 3 months ago
Alternatives and similar repositories for SAE-Dark-Matter:
Users that are interested in SAE-Dark-Matter are comparing it to the libraries listed below
- Efficient Dictionary Learning with Switch Sparse Autoencoders (SAEs)☆20Updated 2 months ago
- ☆140Updated this week
- Code for reproducing our paper "Not All Language Model Features Are Linear"☆66Updated 2 months ago
- Sparse and discrete interpretability tool for neural networks☆59Updated 11 months ago
- Open source replication of Anthropic's Crosscoders for Model Diffing☆31Updated 3 months ago
- ☆54Updated 2 months ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆26Updated 8 months ago
- [NeurIPS 2024] Goldfish Loss: Mitigating Memorization in Generative LLMs☆80Updated 2 months ago
- ☆75Updated 5 months ago
- ☆20Updated 3 months ago
- ☆49Updated 4 months ago
- This is official project in our paper: Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers☆28Updated last year
- gzip Predicts Data-dependent Scaling Laws☆33Updated 8 months ago
- Minimum Description Length probing for neural network representations☆18Updated this week
- A mechanistic approach for understanding and detecting factual errors of large language models.☆39Updated 6 months ago
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks