shehper / sparse-dictionary-learning
An Open Source Implementation of Anthropic's Paper: "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning"
☆29Updated 6 months ago
Related projects ⓘ
Alternatives and complementary repositories for sparse-dictionary-learning
- ☆68Updated 3 months ago
- Function Vectors in Large Language Models (ICLR 2024)☆118Updated last month
- Using sparse coding to find distributed representations used by neural networks.☆181Updated last year
- [NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors☆69Updated 8 months ago
- A resource repository for representation engineering in large language models☆50Updated 2 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆52Updated last week
- AI Logging for Interpretability and Explainability🔬☆88Updated 5 months ago
- DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)☆52Updated last month
- ☆102Updated last month
- ☆44Updated 10 months ago
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆45Updated this week
- ☆79Updated last year
- ☆108Updated last year
- Steering Llama 2 with Contrastive Activation Addition☆95Updated 5 months ago
- ☆26Updated 6 months ago
- Source code of "Task arithmetic in the tangent space: Improved editing of pre-trained models".☆85Updated last year
- ☆75Updated 9 months ago
- ☆70Updated 3 months ago
- Collection of Reverse Engineering in Large Model☆28Updated last week
- In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)☆45Updated 7 months ago
- Implementation of PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)☆26Updated last week
- Algebraic value editing in pretrained language models☆57Updated last year
- Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆21Updated 3 weeks ago
- ☆96Updated 3 months ago
- ☆75Updated 4 months ago
- Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆58Updated last month
- ☆141Updated 3 weeks ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆78Updated last year
- ☆138Updated 4 months ago
- The Paper List on Data Contamination for Large Language Models Evaluation.☆74Updated this week