javiferran / sae_entitiesLinks

☆63

Alternatives and similar repositories for sae_entities

Users that are interested in sae_entities are comparing it to the libraries listed below

Sorting:

yuzhaouoe / SAE-based-representation-engineering
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆66Updated 11 months ago
licong-lin / negative-preference-optimization
☆66Updated last year
jinzhuoran / RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
☆83Updated last year
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆82Updated 7 months ago
zepingyu0512 / neuron-attribution
code for EMNLP 2024 paper: Neuron-Level Knowledge Attribution in Large Language Models
☆45Updated 11 months ago
fc2869 / lo-fit
LoFiT: Localized Fine-tuning on LLM Representations
☆42Updated 9 months ago
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆61Updated last year
boyiwei / alignment-attribution-code
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆85Updated 6 months ago
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆60Updated last year
MikaStars39 / FeatureAlignment
FeatureAlignment = Alignment + Mechanistic Interpretability
☆31Updated 7 months ago
OPTML-Group / Unlearn-Simple
[NeurIPS25] Official repo for "Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning"
☆33Updated 3 weeks ago
yihuaihong / ConceptVectors
[EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆35Updated 2 months ago
princeton-nlp / benign-data-breaks-safety
☆41Updated last year
ericwtodd / function_vectors
Function Vectors in Large Language Models (ICLR 2024)
☆181Updated 6 months ago
VITA-Group / SEAL
Official code for SEAL: Steerable Reasoning Calibration of Large Language Models for Free
☆44Updated 6 months ago
ykwon0407 / DataInf
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)
☆76Updated last year
dannyallover / overthinking_the_truth
☆29Updated last year
alisawuffles / proxy-tuning
Code associated with Tuning Language Models by Proxy (Liu et al., 2024)
☆121Updated last year
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 8 months ago
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆111Updated last month
shuyhere / Awesome-Sparse-Autoencoder
Collection of Reverse Engineering in Large Model
☆34Updated 9 months ago
DAMO-NLP-SG / multilingual_analysis
[NeurIPS 2024] How do Large Language Models Handle Multilingualism?
☆42Updated 11 months ago
ZFancy / awesome-activation-engineering
A curated list of resources for activation engineering
☆107Updated 3 weeks ago
chrisliu298 / awesome-representation-engineering
A resource repository for representation engineering in large language models
☆138Updated 11 months ago
jaechan-repo / muse_bench
☆28Updated last year
CaoYuanpu / BiPO
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
☆33Updated last year
deeplearning-wisc / haloscope
source code for NeurIPS'24 paper "HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection"
☆58Updated 6 months ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆136Updated 4 months ago
swj0419 / muse_bench
☆28Updated 7 months ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆188Updated last year