ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
☆289Updated 3 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for Awesome-Interpretability-in-Large-Language-Models
- Using sparse coding to find distributed representations used by neural networks.☆185Updated last year
- Sparse autoencoders☆344Updated last week
- Mechanistic Interpretability Visualizations using React☆200Updated 4 months ago
- ☆105Updated last month
- Training Sparse Autoencoders on Language Models☆469Updated this week
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆158Updated last month
- Sparse Autoencoder for Mechanistic Interpretability☆188Updated 4 months ago
- ☆107Updated this week
- ☆188Updated last month
- ☆146Updated last month
- ☆331Updated 4 months ago
- The nnsight package enables interpreting and manipulating the internals of deep learned models.☆406Updated this week
- Steering vectors for transformer language models in Pytorch / Huggingface☆65Updated last month
- ☆99Updated 3 months ago
- ☆73Updated 4 months ago
- Steering Llama 2 with Contrastive Activation Addition☆98Updated 6 months ago
- A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..☆175Updated last month
- ☆108Updated last year
- ☆172Updated 9 months ago
- Improving Alignment and Robustness with Circuit Breakers☆154Updated last month
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆78Updated last year
- A bibliography and survey of the papers surrounding o1☆780Updated last week
- A Survey on Data Selection for Language Models☆182Updated last month
- ☆44Updated this week
- AI Logging for Interpretability and Explainability🔬☆89Updated 5 months ago
- Extract full next-token probabilities via language model APIs☆229Updated 9 months ago
- Function Vectors in Large Language Models (ICLR 2024)☆119Updated last month
- A resource repository for representation engineering in large language models☆54Updated last week
- A toolkit for describing model features and intervening on those features to steer behavior.☆106Updated last week
- Codebase for reproducing the experiments of the semantic uncertainty paper (short-phrase and sentence-length experiments).☆247Updated 7 months ago