git-disl/Safety-Tax

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/git-disl/Safety-Tax)

git-disl / Safety-Tax

This is the official code for the paper "Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable".

☆35

Alternatives and similar repositories for Safety-Tax

Users that are interested in Safety-Tax are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

git-disl / Booster
View on GitHub
This is the official code for the paper "Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturba…
☆40Mar 22, 2025Updated last year
git-disl / Lisa
View on GitHub
This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)
☆28Sep 10, 2024Updated last year
tanganke / subspace_fusion
View on GitHub
Code for paper "Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion"
☆14Mar 28, 2024Updated 2 years ago
git-disl / Virus
View on GitHub
This is the official code for the paper "Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation"
☆56Feb 2, 2025Updated last year
avalonstrel / Mitigating-the-Alignment-Tax-of-RLHF
View on GitHub
☆16Feb 8, 2024Updated 2 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
hanshen95 / SEAL
View on GitHub
An implementation of SEAL: Safety-Enhanced Aligned LLM fine-tuning via bilevel data selection.
☆24Feb 20, 2025Updated last year
PKU-YuanGroup / AsFT
View on GitHub
Code for the paper "AsFT: Anchoring Safety During LLM Fune-Tuning Within Narrow Safety Basin".
☆37Jul 10, 2025Updated last year
peijunallin / alphalora
View on GitHub
☆19Nov 10, 2024Updated last year
git-disl / Vaccine
View on GitHub
This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)
☆51Jan 15, 2026Updated 6 months ago
Jayfeather1024 / Backdoor-Enhanced-Alignment
View on GitHub
☆24Dec 8, 2024Updated last year
princeton-polaris-lab / Evaluating-Durable-Safeguards
View on GitHub
[ICLR 2025] On Evluating the Durability of Safegurads for Open-Weight LLMs
☆13Jun 20, 2025Updated last year
IBM / NeuralFuse
View on GitHub
[NeurIPS'24] "NeuralFuse: Learning to Recover the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes" by Hao-Lun …
☆10Sep 18, 2025Updated 10 months ago
git-disl / awesome_LLM-harmful-fine-tuning-papers
View on GitHub
A survey on harmful fine-tuning attack for large language model (ACM CSUR)
☆247Jun 22, 2026Updated 3 weeks ago
ChanLiang / ORIG
View on GitHub
[ACL 2023 findings] Towards Robust Personalized Dialogue Generation via Order-Insensitive Representation Regularization
☆17Aug 26, 2023Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
TDCSZ327 / HTmuon
View on GitHub
☆15May 2, 2026Updated 2 months ago
declare-lab / resta
View on GitHub
Restore safety in fine-tuned language models through task arithmetic
☆33Mar 28, 2024Updated 2 years ago
ys-zong / VLGuard
View on GitHub
[ICML 2024] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models.
☆90Jan 19, 2025Updated last year
Lslland / T-Vaccine
View on GitHub
☆19Jun 21, 2025Updated last year
GraySwanAI / circuit-breakers
View on GitHub
Improving Alignment and Robustness with Circuit Breakers
☆266Sep 24, 2024Updated last year
Egg-Hu / Awesome-Synthetic-Data-Generation
View on GitHub
☆19Jan 7, 2026Updated 6 months ago
WangCheng0116 / Awesome-LRMs-Safety
View on GitHub
Official repository for "Safety in Large Reasoning Models: A Survey" - Exploring safety risks, attacks, and defenses for Large Reasoning …
☆90Aug 25, 2025Updated 10 months ago
IBM / AutoVP
View on GitHub
[ICLR24] "AutoVP: An Automated Visual Prompting Framework and Benchmark" by Hsi-Ai Tsao*, Lei Hsiung*, Pin-Yu Chen, Sijia Liu, and Tsung-…
☆23Sep 18, 2025Updated 10 months ago
LLLeoLi / LARF
View on GitHub
[EMNLP 2025] Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
☆15Jul 22, 2025Updated 11 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
wicai24 / DOOR-Alignment
View on GitHub
☆20Apr 7, 2025Updated last year
rishub-tamirisa / tamper-resistance
View on GitHub
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆68Jun 9, 2025Updated last year
poloclub / llm-landscape
View on GitHub
NeurIPS'24 - LLM Safety Landscape
☆40Oct 21, 2025Updated 8 months ago
ChandlerBang / pytorch-gnn-meta-attack
View on GitHub
Pytorch implementation of gnn meta attack (mettack). Paper title: Adversarial Attacks on Graph Neural Networks via Meta Learning.
☆21Mar 23, 2021Updated 5 years ago
gracefulning / TIDPO
View on GitHub
TOKEN-IMPORTANCE GUIDED DIRECT PREFERENCE OPTIMIZATION
☆38Jan 26, 2026Updated 5 months ago
Unispac / shallow-vs-deep-alignment
View on GitHub
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆186Apr 23, 2025Updated last year
YefanZhou / TempBalance
View on GitHub
[NeurIPS 2023 Spotlight] Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training
☆37Apr 7, 2025Updated last year
haiquanlu / AlphaPruning
View on GitHub
[NeurIPS 2024] AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models
☆34Jun 9, 2025Updated last year
wizard1203 / FuseFL
View on GitHub
FuseFL: One-Shot Federated Learning through the Lens of Causality with Progressive Model Fusion (NeurIPS 2024 Spotlight)
☆15Mar 31, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
decoding-comp-trust / comp-trust
View on GitHub
Codebase for decoding compressed trust.
☆27May 7, 2024Updated 2 years ago
Hritikbansal / jpo
View on GitHub
☆13Jul 2, 2025Updated last year
dsbowen / strong_reject
View on GitHub
☆146Jul 7, 2025Updated last year
tmlr-group / G-effect
View on GitHub
[ICLR 2025] "Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond"
☆16Feb 27, 2025Updated last year
IBM / composite-adv
View on GitHub
[CVPR23] "Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations" by Lei Hsi…
☆23Sep 17, 2025Updated 10 months ago
DualityRL / multi-attempt
View on GitHub
☆19Mar 10, 2025Updated last year
watcl-lab / positional_attention
View on GitHub
Source code for the paper "Positional Attention: Expressivity and Learnability of Algorithmic Computation"
☆14May 26, 2025Updated last year