NUS-TRAIL / Unnatural_LanguageLinks

The official repository of 'Unnatural Language Are Not Bugs but Features for LLMs'

☆24

Alternatives and similar repositories for Unnatural_Language

Users that are interested in Unnatural_Language are comparing it to the libraries listed below

Sorting:

princeton-nlp / benign-data-breaks-safety
☆43Updated last year
VITA-Group / SEAL
[COLM 2025] SEAL: Steerable Reasoning Calibration of Large Language Models for Free
☆50Updated 9 months ago
swj0419 / muse_bench
☆32Updated 10 months ago
sail-sg / ActivePRM
☆20Updated 9 months ago
locuslab / acr-memorization
☆37Updated last year
ethz-spylab / rlhf-poisoning
Code for paper "Universal Jailbreak Backdoors from Poisoned Human Feedback"
☆66Updated last year
licong-lin / negative-preference-optimization
☆71Updated last year
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆98Updated last year
sail-sg / I-FSJ
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆65Updated last year
decoding-comp-trust / comp-trust
Codebase for decoding compressed trust.
☆25Updated last year
tml-epfl / long-is-more-for-alignment
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning [ICML 2024]
☆20Updated last year
milesaturpin / cot-unfaithfulness
☆51Updated 2 years ago
boyiwei / alignment-attribution-code
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆89Updated 9 months ago
ybwang119 / Awesome-reasoning-safety
This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RL
☆59Updated 4 months ago
Jayfeather1024 / Backdoor-Enhanced-Alignment
☆24Updated last year
yihuaihong / ConceptVectors
[EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆40Updated 4 months ago
zhxieml / remiss-jailbreak
☆33Updated last year
uw-nsl / safechain
[ACL 25] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
☆27Updated 9 months ago
git-disl / Safety-Tax
This is the official code for the paper "Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable".
☆27Updated 10 months ago
avalonstrel / Mitigating-the-Alignment-Tax-of-RLHF
☆16Updated last year
paul-rottger / xstest
Röttger et al. (NAACL 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
☆123Updated 10 months ago
git-disl / Lisa
This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)
☆25Updated last year
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆85Updated 10 months ago
ShuheSH / A-Survey-of-the-Reasoning-Abilities-of-LLMs
☆26Updated 10 months ago
sail-sg / closer-look-LLM-unlearning
[ICLR 2025] A Closer Look at Machine Unlearning for Large Language Models
☆42Updated last year
Model-GLUE / Model-GLUE
☆18Updated last year
SORRY-Bench / sorry-bench
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)
☆73Updated 10 months ago
pipilurj / MLLM-protector
The official repository for paper "MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance"
☆44Updated last year
y0mingzhang / diffuse-distributions
Forcing Diffuse Distributions out of Language Models
☆18Updated last year
rishub-tamirisa / tamper-resistance
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
☆65Updated 7 months ago