Dakingrai / awesome-mechanistic-interpretability-lm-papers
☆136Updated 4 months ago
Alternatives and similar repositories for awesome-mechanistic-interpretability-lm-papers:
Users that are interested in awesome-mechanistic-interpretability-lm-papers are comparing it to the libraries listed below
- Using sparse coding to find distributed representations used by neural networks.☆225Updated last year
- Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …☆165Updated this week
- For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.☆105Updated last week
- Steering Llama 2 with Contrastive Activation Addition☆131Updated 10 months ago
- This repository collects all relevant resources about interpretability in LLMs☆327Updated 4 months ago
- ☆78Updated last week
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆91Updated last year
- ☆213Updated 5 months ago
- ☆151Updated this week
- Steering vectors for transformer language models in Pytorch / Huggingface☆91Updated last month
- Function Vectors in Large Language Models (ICLR 2024)☆151Updated last week
- ☆90Updated last month
- ☆82Updated 7 months ago
- A resource repository for representation engineering in large language models☆115Updated 4 months ago
- ☆66Updated 4 months ago
- AI Logging for Interpretability and Explainability🔬☆108Updated 9 months ago
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.☆69Updated 3 weeks ago
- ☆93Updated last year
- LoFiT: Localized Fine-tuning on LLM Representations☆34Updated 2 months ago
- ☆113Updated 7 months ago
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆52Updated 4 months ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆190Updated 3 months ago
- A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..☆208Updated last week
- An Open Source Implementation of Anthropic's Paper: "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning"☆41Updated 10 months ago
- General-purpose activation steering library☆52Updated 2 months ago
- ☆196Updated last year
- ☆23Updated last month
- Sparse probing paper full code.☆55Updated last year
- Mechanistic Interpretability Visualizations using React☆235Updated 3 months ago
- ☆258Updated last month