msakarvadia / AttentionLensLinks

Interpretating the latent space representations of attention head outputs for LLMs

☆34

Alternatives and similar repositories for AttentionLens

Users that are interested in AttentionLens are comparing it to the libraries listed below

Sorting:

ckkissane / sae-transfer
Code to reproduce key results accompanying "SAEs (usually) Transfer Between Base and Chat Models"
☆12Updated last year
ckkissane / crosscoder-model-diff-replication
Open source replication of Anthropic's Crosscoders for Model Diffing
☆59Updated last year
google / belief-localization
This repository includes code for the paper "Does Localization Inform Editing? Surprising Differences in Where Knowledge Is Stored vs. Ca…
☆61Updated 2 years ago
jiahai-feng / binding-iclr
☆16Updated last year
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆78Updated 3 months ago
KoyenaPal / future-lens
Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
☆20Updated last week
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
Nix07 / finetuning
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity…
☆28Updated this week
montemac / activation_additions
Algebraic value editing in pretrained language models
☆66Updated 2 years ago
mega002 / ff-layers
The accompanying code for "Transformer Feed-Forward Layers Are Key-Value Memories". Mor Geva, Roei Schuster, Jonathan Berant, and Omer Le…
☆97Updated 4 years ago
HazyResearch / skill-it
Skill-It! A Data-Driven Skills Framework for Understanding and Training Language Models
☆47Updated 2 years ago
ApolloResearch / e2e_sae
Sparse Autoencoder Training Library
☆55Updated 6 months ago
XiangLi1999 / AutoBencher
☆32Updated last year
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆136Updated 4 months ago
abhishekpanigrahi1996 / transformer_in_transformer
☆45Updated 2 years ago
saprmarks / geometry-of-truth
☆92Updated last year
haotiansun14 / BBox-Adapter
Lightweight Adapting for Black-Box Large Language Models
☆23Updated last year
explanare / ravel
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆56Updated last year
janphilippfranken / sami
Self-Supervised Alignment with Mutual Information
☆21Updated last year
allenai / hyper-task-descriptions
Learning adapter weights from task descriptions
☆19Updated last year
MaheepChaudhary / SAE-Ravel
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆12Updated 9 months ago
KihoPark / linear_rep_geometry
☆108Updated 8 months ago
epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆84Updated last year
probabilistic-inference-scaling / probabilistic-inference-scaling
☆51Updated 7 months ago
adamkarvonen / SAE_BoardGameEval
☆23Updated 9 months ago
guy-dar / embedding-space
☆55Updated 2 years ago
MadryLab / DsDm
☆51Updated last year
roeehendel / icl_task_vectors
☆98Updated 2 years ago
tatsu-lab / linguistic_calibration
Align your LM to express calibrated verbal statements of confidence in its long-form generations.
☆27Updated last year
mlfoundations / scaling
Language models scale reliably with over-training and on downstream tasks
☆100Updated last year