ruizheliUOA / Awesome-Interpretability-in-Large-Language-ModelsLinks

This repository collects all relevant resources about interpretability in LLMs

☆385

Alternatives and similar repositories for Awesome-Interpretability-in-Large-Language-Models

Users that are interested in Awesome-Interpretability-in-Large-Language-Models are comparing it to the libraries listed below

Sorting:

Dakingrai / awesome-mechanistic-interpretability-lm-papers
☆221Updated last year
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆287Updated 2 years ago
jacobdunefsky / transcoder_circuits
☆191Updated last year
ArthurConmy / Automatic-Circuit-Discovery
☆258Updated last year
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆284Updated last year
cooperleong00 / Awesome-LLM-Interpretability
A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..
☆286Updated 8 months ago
stanfordnlp / pyvene
Stanford NLP Python library for understanding and improving PyTorch models via interventions
☆836Updated last month
adamkarvonen / SAEBench
☆136Updated 2 weeks ago
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆550Updated 4 months ago
EleutherAI / sparsify
Sparsify transformers with SAEs and transcoders
☆670Updated this week
saprmarks / feature-circuits
☆196Updated last month
chrisliu298 / awesome-representation-engineering
A resource repository for representation engineering in large language models
☆142Updated last year
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆231Updated this week
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆302Updated 11 months ago
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆231Updated 11 months ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆195Updated last year
saprmarks / dictionary_learning
☆367Updated 3 months ago
openai / sparse_autoencoder
☆549Updated last year
ARBORproject / arborproject.github.io
☆83Updated 9 months ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆130Updated 9 months ago
redwoodresearch / Easy-Transformer
☆130Updated last year
interpretingdl / eacl2024_transformer_interpretability_tutorial
Materials for EACL2024 tutorial: Transformer-specific Interpretability
☆61Updated last year
OpenMOSS / Language-Model-SAEs
Performant framework for training, analyzing and visualizing Sparse Autoencoders (SAEs) and their frontier variants.
☆167Updated this week
davidbau / baukit
☆238Updated last year
decoderesearch / SAELens
Training Sparse Autoencoders on Language Models
☆1,093Updated last week
shehper / sparse-dictionary-learning
An Open Source Implementation of Anthropic's Paper: "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning"
☆49Updated last year
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆150Updated 5 months ago
ndif-team / nnsight
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆716Updated this week
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆100Updated 2 years ago
wesg52 / sparse-probing-paper
Sparse probing paper full code.
☆65Updated last year