yfqiu-nlp / sea-llmLinks

Code for the paper "Spectral Editing of Activations for Large Language Model Alignments"

☆29

Alternatives and similar repositories for sea-llm

Users that are interested in sea-llm are comparing it to the libraries listed below

Sorting:

ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆84Updated 8 months ago
milesaturpin / cot-unfaithfulness
☆51Updated 2 years ago
stanfordnlp / axbench
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
☆145Updated 5 months ago
roeehendel / icl_task_vectors
☆101Updated 2 years ago
explanare / ravel
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆56Updated 3 weeks ago
yuzhaouoe / SAE-based-representation-engineering
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆67Updated last year
dannyallover / overthinking_the_truth
☆29Updated last year
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆120Updated 2 months ago
MaheepChaudhary / SAE-Ravel
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆12Updated 10 months ago
logix-project / logix
AI Logging for Interpretability and Explainability🔬
☆133Updated last year
ykwon0407 / DataInf
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)
☆76Updated last year
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 9 months ago
abhishekpanigrahi1996 / Skill-Localization-by-grafting
☆51Updated last year
montemac / activation_additions
Algebraic value editing in pretrained language models
☆66Updated 2 years ago
activatedgeek / calibration-tuning
☆52Updated 7 months ago
yihuaihong / ConceptVectors
[EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆38Updated 3 months ago
ruiqi-zhong / nlparam
Augmenting Statistical Models with Natural Language Parameters
☆29Updated last year
Thartvigsen / GRACE
[NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors
☆82Updated 11 months ago
saprmarks / geometry-of-truth
☆94Updated last year
ericwtodd / function_vectors
Function Vectors in Large Language Models (ICLR 2024)
☆185Updated 7 months ago
hannamw / EAP-IG
☆61Updated 4 months ago
fc2869 / lo-fit
LoFiT: Localized Fine-tuning on LLM Representations
☆45Updated 10 months ago
deeplearning-wisc / args
☆46Updated last year
tatsu-lab / linguistic_calibration
Align your LM to express calibrated verbal statements of confidence in its long-form generations.
☆27Updated last year
declare-lab / resta
Restore safety in fine-tuned language models through task arithmetic
☆29Updated last year
mega002 / ff-layers
The accompanying code for "Transformer Feed-Forward Layers Are Key-Value Memories". Mor Geva, Roei Schuster, Jonathan Berant, and Omer Le…
☆99Updated 4 years ago
peterljq / Parsimonious-Concept-Engineering
PaCE: Parsimonious Concept Engineering for Large Language Models (NeurIPS 2024)
☆40Updated last year
socialfoundations / tttlm
Test-time-training on nearest neighbors for large language models
☆47Updated last year
edenbiran / HoppingTooLate
Exploring the Limitations of Large Language Models on Multi-Hop Queries
☆29Updated 8 months ago
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆194Updated last year