Dakingrai / awesome-mechanistic-interpretability-lm-papersLinks

☆177

Alternatives and similar repositories for awesome-mechanistic-interpretability-lm-papers

Users that are interested in awesome-mechanistic-interpretability-lm-papers are comparing it to the libraries listed below

Sorting:

ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
☆366Updated 9 months ago
jacobdunefsky / transcoder_circuits
☆154Updated 8 months ago
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆261Updated last year
OpenMOSS / Language-Model-SAEs
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
☆141Updated last week
redwoodresearch / Easy-Transformer
☆121Updated last year
ericwtodd / function_vectors
Function Vectors in Large Language Models (ICLR 2024)
☆175Updated 3 months ago
ArthurConmy / Automatic-Circuit-Discovery
☆233Updated 10 months ago
saprmarks / feature-circuits
☆183Updated 3 weeks ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆167Updated last year
neelnanda-io / Crosscoders
☆50Updated 8 months ago
cooperleong00 / Awesome-LLM-Interpretability
A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..
☆262Updated 4 months ago
davidbau / baukit
☆222Updated last year
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆200Updated this week
chrisliu298 / awesome-representation-engineering
A resource repository for representation engineering in large language models
☆129Updated 8 months ago
hannamw / EAP-IG
☆45Updated last week
KihoPark / linear_rep_geometry
☆103Updated 5 months ago
saprmarks / geometry-of-truth
☆89Updated 11 months ago
roeehendel / icl_task_vectors
☆96Updated last year
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆257Updated last year
adamkarvonen / SAEBench
☆107Updated 2 weeks ago
saprmarks / dictionary_learning
☆324Updated 3 weeks ago
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆87Updated last week
ARBORproject / arborproject.github.io
☆81Updated 5 months ago
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆57Updated 9 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆112Updated last month
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆119Updated 5 months ago
logix-project / logix
AI Logging for Interpretability and Explainability🔬
☆124Updated last year
javiferran / sae_entities
☆60Updated 4 months ago
openai / sparse_autoencoder
☆505Updated last year
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆73Updated 2 weeks ago