ajyl / dpo_toxicLinks

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.

☆82

Alternatives and similar repositories for dpo_toxic

Users that are interested in dpo_toxic are comparing it to the libraries listed below

Sorting:

ykwon0407 / DataInf
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)
☆76Updated last year
deeplearning-wisc / args
☆45Updated last year
IBM / activation-steering
[ICLR 2025] General-purpose activation steering library
☆111Updated last month
licong-lin / negative-preference-optimization
☆66Updated last year
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆116Updated 8 months ago
dannyallover / overthinking_the_truth
☆29Updated last year
roeehendel / icl_task_vectors
☆98Updated last year
boyiwei / alignment-attribution-code
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆85Updated 6 months ago
chrisliu298 / awesome-representation-engineering
A resource repository for representation engineering in large language models
☆138Updated 11 months ago
logix-project / logix
AI Logging for Interpretability and Explainability🔬
☆129Updated last year
milesaturpin / cot-unfaithfulness
☆48Updated 2 years ago
yuzhaouoe / SAE-based-representation-engineering
[NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
☆66Updated 11 months ago
javiferran / sae_entities
☆63Updated 7 months ago
balevinstein / Probes
☆56Updated 2 years ago
princeton-nlp / benign-data-breaks-safety
☆41Updated last year
montemac / activation_additions
Algebraic value editing in pretrained language models
☆66Updated last year
ericwtodd / function_vectors
Function Vectors in Large Language Models (ICLR 2024)
☆181Updated 6 months ago
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆99Updated last year
abhishekpanigrahi1996 / Skill-Localization-by-grafting
☆51Updated last year
fc2869 / lo-fit
LoFiT: Localized Fine-tuning on LLM Representations
☆41Updated 9 months ago
CaoYuanpu / BiPO
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
☆33Updated last year
explanare / ravel
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
☆56Updated last year
nrimsky / CAA
Steering Llama 2 with Contrastive Activation Addition
☆188Updated last year
Thartvigsen / GRACE
[NeurIPS'23] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors
☆81Updated 10 months ago
hannamw / EAP-IG
☆57Updated 3 months ago
nrimsky / LM-exp
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
☆98Updated 2 years ago
rishub-tamirisa / tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆62Updated 4 months ago
swj0419 / muse_bench
☆28Updated 7 months ago
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆60Updated last year
tatsu-lab / linguistic_calibration
Align your LM to express calibrated verbal statements of confidence in its long-form generations.
☆27Updated last year