avalonstrel/Mitigating-the-Alignment-Tax-of-RLHF

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/avalonstrel/Mitigating-the-Alignment-Tax-of-RLHF)

avalonstrel / Mitigating-the-Alignment-Tax-of-RLHF

☆16

Alternatives and similar repositories for Mitigating-the-Alignment-Tax-of-RLHF

Users that are interested in Mitigating-the-Alignment-Tax-of-RLHF are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

ethz-spylab / jailbreak-tax
View on GitHub
☆24Feb 17, 2026Updated 5 months ago
git-disl / Safety-Tax
View on GitHub
This is the official code for the paper "Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable".
☆35Mar 11, 2025Updated last year
SophieZheng998 / ALI-Agent
View on GitHub
Official implementation for "ALI-Agent: Assessing LLMs'Alignment with Human Values via Agent-based Evaluation"
☆21Jan 31, 2026Updated 5 months ago
zjunlp / LookAheadTuning
View on GitHub
[WSDM 2026] LookAhead Tuning: Safer Language Models via Partial Answer Previews
☆17Dec 14, 2025Updated 7 months ago
uclaml / PDE
View on GitHub
Official repo of Progressive Data Expansion: data, code and evaluation
☆29Nov 16, 2023Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
CryptoAILab / misalignment
View on GitHub
[NDSS'25] The official implementation of safety misalignment.
☆19Jan 8, 2025Updated last year
Jayfeather1024 / Backdoor-Enhanced-Alignment
View on GitHub
☆24Dec 8, 2024Updated last year
git-disl / Booster
View on GitHub
This is the official code for the paper "Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturba…
☆41Mar 22, 2025Updated last year
technion-cs-nlp / irm-for-nli
View on GitHub
☆11Jun 2, 2022Updated 4 years ago
windxrz / DCFR
View on GitHub
Source code for KDD 2020 paper "Algorithmic Decision Making with Conditional Fairness".
☆16Apr 7, 2026Updated 3 months ago
p-lambda / in-n-out
View on GitHub
Code for the ICLR 2021 Paper "In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness"
☆13Oct 23, 2021Updated 4 years ago
ChanLiang / CONNER
View on GitHub
[EMNLP 2023] Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators
☆33Jan 22, 2024Updated 2 years ago
CERT-Lab / fed-sb
View on GitHub
(TMLR J2C Certification) Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tu…
☆27Oct 4, 2025Updated 9 months ago
yale-nlp / InstruSum
View on GitHub
☆23Feb 26, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
mstrise / seq2label-crossrep
View on GitHub
Sequence Labeling Parsing by Learning Across Representations
☆13Oct 3, 2019Updated 6 years ago
thestephencasper / benchmarking_interpretability
View on GitHub
☆35Sep 13, 2023Updated 2 years ago
homles11 / SaLoRA
View on GitHub
Code for “SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation(ICLR 2025)”
☆29Oct 23, 2025Updated 9 months ago
facebookresearch / InvarianceUnitTests
View on GitHub
Toy datasets to evaluate algorithms for domain generalization and invariance learning.
☆43Dec 5, 2021Updated 4 years ago
zzp1012 / Cross-Task-Linearity
View on GitHub
[ICML 2024] Code release for "On the Emergence of Cross-Task Linearity in Pretraining-Finetuning Paradigm"
☆11Feb 20, 2025Updated last year
AISafety-HKUST / Backdoor_Safety_Tuning
View on GitHub
Backdoor Safety Tuning (NeurIPS 2023 & 2024 Spotlight)
☆27Nov 18, 2024Updated last year
zhiyuanhubj / Meta-Ability-Alignment
View on GitHub
Official code of paper "Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models"
☆88May 27, 2025Updated last year
aghie / parsing-as-pretraining
View on GitHub
Parsing only with Pretraining Networks
☆16Jul 25, 2024Updated 2 years ago
mstrise / dep2label-bert
View on GitHub
Dependency Parsing as Sequence Labeling with BERT
☆13Nov 1, 2020Updated 5 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
frankaging / Causal-Distill
View on GitHub
The Codebase for Causal Distillation for Language Models (NAACL '22)
☆26May 1, 2022Updated 4 years ago
princeton-polaris-lab / Evaluating-Durable-Safeguards
View on GitHub
[ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs
☆13Jun 20, 2025Updated last year
rmin2000 / adv_tracing
View on GitHub
Identification of the Adversary from a Single Adversarial Example (ICML 2023)
☆10Jul 15, 2024Updated 2 years ago
cuhksz-nlp / SAPar
View on GitHub
☆12Dec 23, 2022Updated 3 years ago
SCIR-SC-Qiaoban-Team / FreeEvalLM
View on GitHub
[AAAI26] Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilitie…
☆11Feb 7, 2026Updated 5 months ago
alestolfo / causal-math
View on GitHub
Code Repository for "A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models".
☆15Oct 14, 2022Updated 3 years ago
tanganke / subspace_fusion
View on GitHub
Code for paper "Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion"
☆14Mar 28, 2024Updated 2 years ago
git-disl / Virus
View on GitHub
This is the official code for the paper "Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation"
☆56Feb 2, 2025Updated last year
haoyuzhao123 / LeanIneqComp
View on GitHub
An inequality benchmark for theorem proving
☆22Feb 1, 2026Updated 5 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
WeiXiongUST / Decentralized-Proximal-Algorithm-with-Variance-Reduction
View on GitHub
This is the code used for the paper "PMGT-VR: A decentralized proximal-gradient algorithmic framework with variance reduction", prepint.
☆15Jul 2, 2022Updated 4 years ago
r-three / AttriBoT
View on GitHub
Code for AttriBoT from "AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution"
☆15Apr 21, 2025Updated last year
IBM / NeuralFuse
View on GitHub
[NeurIPS'24] "NeuralFuse: Learning to Recover the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes" by Hao-Lun …
☆10Sep 18, 2025Updated 10 months ago
MaheepChaudhary / SAE-Ravel
View on GitHub
Providing the answer to "How to do patching on all available SAEs on GPT-2?". It is an official repository of the implementation of the p…
☆13Jan 26, 2025Updated last year
karapto / FedBERT
View on GitHub
FedBERT : A federated approach that enables clients with limited computing resource to participate without violating data privacy.
☆14Jul 3, 2023Updated 3 years ago
ColinLu50 / SafeDelta
View on GitHub
The official code repo for "Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets" in ICML 2025.
☆59Feb 12, 2026Updated 5 months ago
linyongver / Bayesian-Invariant-Risk-Minmization
View on GitHub
This is the code for the paper Bayesian Invariant Risk Minmization of CVPR 2022.
☆50Jun 25, 2023Updated 3 years ago