davidbau / baukitLinks

☆220

Alternatives and similar repositories for baukit

Users that are interested in baukit are comparing it to the libraries listed below

Sorting:

nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆167Updated last year
jacobdunefsky / transcoder_circuits
☆154Updated 8 months ago
ericwtodd / function_vectors
Function Vectors in Large Language Models (ICLR 2024)
☆172Updated 3 months ago
IBM / activation-steering
General-purpose activation steering library
☆85Updated 2 months ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆96Updated last year
montemac / activation_additions
Algebraic value editing in pretrained language models
☆65Updated last year
HoagyC / sparse_coding
Using sparse coding to find distributed representations used by neural networks.
☆261Updated last year
OpenMOSS / Language-Model-SAEs
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
☆141Updated last week
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆74Updated 4 months ago
ArthurConmy / Automatic-Circuit-Discovery
☆233Updated 10 months ago
chrisliu298 / awesome-representation-engineering
A resource repository for representation engineering in large language models
☆129Updated 8 months ago
roeehendel / icl_task_vectors
☆96Updated last year
redwoodresearch / Easy-Transformer
☆121Updated 11 months ago
saprmarks / feature-circuits
☆183Updated 2 weeks ago
wesg52 / sparse-probing-paper
Sparse probing paper full code.
☆58Updated last year
KihoPark / linear_rep_geometry
☆103Updated 5 months ago
logix-project / logix
AI Logging for Interpretability and Explainability🔬
☆124Updated last year
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆102Updated 5 months ago
adamkarvonen / SAEBench
☆107Updated 2 weeks ago
saprmarks / geometry-of-truth
☆87Updated 11 months ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆73Updated last week
hannamw / EAP-IG
☆45Updated last week
EleutherAI / delphi
Delphi was the home of a temple to Phoebus Apollo, which famously had the inscription, 'Know Thyself.' This library lets language models …
☆200Updated this week
likenneth / honest_llama
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
☆539Updated 6 months ago
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆246Updated last month
AlignmentResearch / tuned-lens
Tools for understanding how transformer predictions are built layer-by-layer
☆512Updated last year
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆119Updated 5 months ago
evandez / REMEDI
Inspecting and Editing Knowledge Representations in Language Models
☆116Updated 2 years ago
Dakingrai / awesome-mechanistic-interpretability-lm-papers
☆177Updated 8 months ago
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆222Updated 10 months ago