shehper / sparse-dictionary-learning
An Open Source Implementation of Anthropic's Paper: "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning"
☆36Updated 8 months ago
Alternatives and similar repositories for sparse-dictionary-learning:
Users that are interested in sparse-dictionary-learning are comparing it to the libraries listed below
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆62Updated 2 months ago
- Using sparse coding to find distributed representations used by neural networks.☆210Updated last year
- Function Vectors in Large Language Models (ICLR 2024)☆135Updated 3 months ago
- ☆86Updated last year
- Steering Llama 2 with Contrastive Activation Addition☆119Updated 8 months ago
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆84Updated last year
- General-purpose activation steering library☆42Updated 3 weeks ago
- DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)☆61Updated 3 months ago
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆88Updated this week
- Collection of Reverse Engineering in Large Model☆31Updated 3 weeks ago
- ☆110Updated 2 months ago
- LoFiT: Localized Fine-tuning on LLM Representations☆30Updated 2 weeks ago
- Algebraic value editing in pretrained language models☆62Updated last year
- [NAACL'25] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆41Updated 2 months ago
- A resource repository for representation engineering in large language models☆98Updated 2 months ago
- ☆109Updated 5 months ago
- ☆30Updated 9 months ago
- ☆75Updated 5 months ago
- ☆135Updated this week
- Official Code for Paper: Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications☆68Updated 3 months ago
- FeatureAlignment = Alignment + Mechanistic Interpretability☆26Updated 2 weeks ago
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆67Updated 6 months ago
- code for EMNLP 2024 paper: Neuron-Level Knowledge Attribution in Large Language Models☆26Updated 2 months ago
- Röttger et al. (2023): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"☆79Updated last year
- Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".☆65Updated 10 months ago
- This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…☆44Updated last month
- Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization☆16Updated 6 months ago
- ☆139Updated this week
- A library for efficient patching and automatic circuit discovery.☆48Updated 2 months ago
- ☆52Updated last year