stanfordnlp / pyveneLinks

Stanford NLP Python library for understanding and improving PyTorch models via interventions

☆793

Alternatives and similar repositories for pyvene

Users that are interested in pyvene are comparing it to the libraries listed below

Sorting:

AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆514Updated 2 weeks ago
jbloomAus / SAELens
Training Sparse Autoencoders on Language Models
☆919Updated this week
EleutherAI / sparsify
Sparsify transformers with SAEs and transcoders
☆608Updated this week
ndif-team / nnsight
The nnsight package enables interpreting and manipulating the internals of deep learned models.
☆630Updated this week
ruizheliUOA / Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
☆368Updated 9 months ago
TransformerLensOrg / CircuitsVis
Mechanistic Interpretability Visualizations using React
☆280Updated 8 months ago
openai / sparse_autoencoder
☆508Updated last year
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆262Updated last year
andyzoujm / representation-engineering
Representation Engineering: A Top-Down Approach to AI Transparency
☆865Updated last year
ai-safety-foundation / sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
☆258Updated last year
saprmarks / dictionary_learning
☆327Updated this week
callummcdougall / sae_vis
Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).
☆211Updated 8 months ago
ArthurConmy / Automatic-Circuit-Discovery
☆237Updated 10 months ago
jacobdunefsky / transcoder_circuits
☆162Updated 9 months ago
kmeng01 / rome
Locating and editing factual associations in GPT (NeurIPS 2022)
☆657Updated last year
inseq-team / inseq
Interpretability for sequence generation models 🐛 🔍
☆433Updated 3 months ago
kmeng01 / memit
Mass-editing thousands of facts into a transformer memory (ICLR 2023)
☆510Updated last year
davidbau / baukit
☆223Updated last year
collin-burns / discovering_latent_knowledge
☆276Updated last year
Dakingrai / awesome-mechanistic-interpretability-lm-papers
☆185Updated 9 months ago
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆206Updated this week
adamkarvonen / SAEBench
☆111Updated last month
likenneth / honest_llama
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
☆542Updated 6 months ago
callummcdougall / ARENA_2.0
Resources for skilling up in AI alignment research engineering. Covers basics of deep learning, mechanistic interpretability, and RL.
☆221Updated last week
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆174Updated last year
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆257Updated 2 months ago
TransformerLensOrg / TransformerLens
A library for mechanistic interpretability of GPT-style language models
☆2,479Updated last week
IINemo / lm-polygraph
☆331Updated this week
saprmarks / feature-circuits
☆184Updated last month
redwoodresearch / Easy-Transformer
☆122Updated last year