shacharKZ / Visualizing-the-Information-Flow-of-GPTLinks

☆11

Alternatives and similar repositories for Visualizing-the-Information-Flow-of-GPT

Users that are interested in Visualizing-the-Information-Flow-of-GPT are comparing it to the libraries listed below

Sorting:

aryamanarora / causalgym
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
☆48Updated 10 months ago
saprmarks / geometry-of-truth
☆92Updated last year
shauli-ravfogel / rlace-icml
☆36Updated 3 years ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆98Updated 2 years ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆136Updated 3 months ago
jmerullo / lm_vector_arithmetic
☆36Updated 2 years ago
tatsu-lab / opinions_qa
☆116Updated last year
asaparov / prontoqa
Synthetic question-answering dataset to formally analyze the chain-of-thought output of large language models on a reasoning task.
☆150Updated last month
explanare / ravel
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆56Updated last year
csinva / iprompt
Finding semantically meaningful and accurate prompts.
☆48Updated last year
EleutherAI / elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…
☆28Updated last year
xlang-ai / icl-selective-annotation
[ICLR 2023] Code for our paper "Selective Annotation Makes Language Models Better Few-Shot Learners"
☆111Updated 2 years ago
microsoft / mechanistic-error-probe
A mechanistic approach for understanding and detecting factual errors of large language models.
☆46Updated last year
ruiqi-zhong / DescribeDistributionalDifferences
Code for preprint: Summarizing Differences between Text Distributions with Natural Language
☆43Updated 2 years ago
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆81Updated 7 months ago
logix-project / logix
AI Logging for Interpretability and Explainability🔬
☆129Updated last year
milesaturpin / cot-unfaithfulness
☆48Updated last year
meg-tong / sycophancy-eval
datasets from the paper "Towards Understanding Sycophancy in Language Models"
☆94Updated last year
LoryPack / LLM-LieDetector
Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"
☆71Updated last year
balevinstein / Probes
☆56Updated 2 years ago
davidbau / baukit
☆233Updated last year
HazyResearch / skill-it
Skill-It! A Data-Driven Skills Framework for Understanding and Training Language Models
☆47Updated last year
zlin7 / UQ-NLG
☆100Updated last year
yuzhaouoe / SAE-based-representation-engineering
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆64Updated 10 months ago
guy-dar / embedding-space
☆55Updated 2 years ago
causalNLP / cladder
We develop benchmarks and analysis tools to evaluate the causal reasoning abilities of LLMs.
☆128Updated last year
activatedgeek / calibration-tuning
☆52Updated 6 months ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆77Updated 2 months ago
google / belief-localization
This repository includes code for the paper "Does Localization Inform Editing? Surprising Differences in Where Knowledge Is Stored vs. Ca…
☆61Updated 2 years ago
feradauto / MoralCoT
Repo for: When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment
☆38Updated 2 years ago