sciai-lab / Truth_is_UniversalLinks

☆25

Alternatives and similar repositories for Truth_is_Universal

Users that are interested in Truth_is_Universal are comparing it to the libraries listed below

Sorting:

balevinstein / Probes
☆51Updated last year
yuzhaouoe / SAE-based-representation-engineering
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆60Updated 7 months ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆158Updated last year
hannamw / EAP-IG
☆37Updated last month
javiferran / sae_entities
☆44Updated 3 months ago
saprmarks / geometry-of-truth
☆85Updated 10 months ago
IBM / activation-steering
General-purpose activation steering library
☆78Updated last month
adamkarvonen / SAEBench
☆101Updated 3 weeks ago
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆72Updated 3 months ago
Aaquib111 / edge-attribution-patching
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"
☆35Updated last year
explanare / ravel
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆47Updated 8 months ago
UFO-101 / auto-circuit
A library for efficient patching and automatic circuit discovery.
☆67Updated 2 months ago
Varal7 / conformal-language-modeling
Conformal Language Modeling
☆30Updated last year
MaheepChaudhary / SAE-Ravel
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆11Updated 5 months ago
milesaturpin / cot-unfaithfulness
☆44Updated last year
CEBaBing / CEBaB
CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior
☆12Updated 2 years ago
wesg52 / sparse-probing-paper
Sparse probing paper full code.
☆58Updated last year
chrisliu298 / awesome-representation-engineering
A resource repository for representation engineering in large language models
☆126Updated 7 months ago
fc2869 / lo-fit
LoFiT: Localized Fine-tuning on LLM Representations
☆39Updated 5 months ago
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
☆108Updated 4 months ago
zepingyu0512 / neuron-attribution
code for EMNLP 2024 paper: Neuron-Level Knowledge Attribution in Large Language Models
☆35Updated 7 months ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆94Updated last year
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆95Updated 3 weeks ago
interpretingdl / eacl2024_transformer_interpretability_tutorial
Materials for EACL2024 tutorial: Transformer-specific Interpretability
☆55Updated last year
dannyallover / overthinking_the_truth
☆29Updated last year
KihoPark / linear_rep_geometry
☆95Updated 4 months ago
jacobdunefsky / transcoder_circuits
☆131Updated 7 months ago
peterljq / Parsimonious-Concept-Engineering
PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)
☆37Updated 7 months ago
roeehendel / icl_task_vectors
☆95Updated last year
Thartvigsen / GRACE
[NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors
☆77Updated 6 months ago