tonychenxyz / selfieLinks

This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen, Carl Vondrick, and Chengzhi Mao.

☆53

Alternatives and similar repositories for selfie

Users that are interested in selfie are comparing it to the libraries listed below

Sorting:

ericwtodd / function_vectors
Function Vectors in Large Language Models (ICLR 2024)
☆184Updated 7 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆141Updated 4 months ago
zjunlp / KnowledgeCircuits
[NeurIPS 2024] Knowledge Circuits in Pretrained Transformers
☆159Updated this week
ucl-dark / llm_debate
Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆121Updated last year
Luckfort / CD
[COLING'25] Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?
☆82Updated 9 months ago
Glaciohound / LM-Steer
Official Code Repository for LM-Steer Paper: "Word Embeddings Are Steers for Language Models" (ACL 2024 Outstanding Paper Award)
☆129Updated 4 months ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆193Updated last year
technion-cs-nlp / LLMsKnow
☆82Updated 9 months ago
shengliu66 / ICV
Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
☆192Updated 9 months ago
peterljq / Parsimonious-Concept-Engineering
PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)
☆40Updated last year
GraySwanAI / circuit-breakers
Improving Alignment and Robustness with Circuit Breakers
☆242Updated last year
OpenMOSS / Language-Model-SAEs
Performant framework for training, analyzing and visualizing Sparse Autoencoders (SAEs) and their frontier variants.
☆163Updated this week
tatsu-lab / test_set_contamination
☆41Updated 2 years ago
da03 / Internalize_CoT_Step_by_Step
☆197Updated 7 months ago
MikaStars39 / FeatureAlignment
FeatureAlignment = Alignment + Mechanistic Interpretability
☆31Updated 8 months ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆98Updated 2 years ago
MingLiiii / Layer_Gradient
[ACL'25 Oral] What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective
☆74Updated 4 months ago
alisawuffles / proxy-tuning
Code associated with Tuning Language Models by Proxy (Liu et al., 2024)
☆121Updated last year
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆84Updated 8 months ago
yuzhaouoe / SAE-based-representation-engineering
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆67Updated 11 months ago
IBM / sae-steering
Code to enable layer-level steering in LLMs using sparse auto encoders
☆28Updated 2 months ago
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆119Updated 2 months ago
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 8 months ago
QingruZhang / PASTA
PASTA: Post-hoc Attention Steering for LLMs
☆127Updated 11 months ago
montemac / activation_additions
Algebraic value editing in pretrained language models
☆66Updated 2 years ago
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆62Updated last year
saprmarks / geometry-of-truth
☆94Updated last year
sail-sg / Agent-Smith
[ICML 2024] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
☆115Updated last year
javiferran / sae_entities
☆63Updated 8 months ago
chrisliu298 / awesome-representation-engineering
A resource repository for representation engineering in large language models
☆140Updated last year