uw-nsl / safechainLinks

[ACL 25] SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

☆20

Alternatives and similar repositories for safechain

Users that are interested in safechain are comparing it to the libraries listed below

Sorting:

princeton-nlp / benign-data-breaks-safety
☆41Updated last year
boyiwei / alignment-attribution-code
[ICML 2024] Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
☆85Updated 6 months ago
VITA-Group / SEAL
Official code for SEAL: Steerable Reasoning Calibration of Large Language Models for Free
☆44Updated 6 months ago
tmlr-group / AR-Bench
[ICML 2025] "From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?"
☆45Updated 2 weeks ago
Unispac / shallow-vs-deep-alignment
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
☆157Updated 6 months ago
git-disl / Safety-Tax
This is the official code for the paper "Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable".
☆26Updated 7 months ago
tmlr-group / NoisyRationales
[NeurIPS 2024] "Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?"
☆37Updated 3 months ago
swj0419 / muse_bench
☆28Updated 7 months ago
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆99Updated last year
licong-lin / negative-preference-optimization
☆66Updated last year
git-disl / Vaccine
This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)
☆47Updated 11 months ago
ybwang119 / Awesome-reasoning-safety
This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RL
☆46Updated last month
David-Li0406 / AI-Supervision-Risk
☆21Updated 7 months ago
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆60Updated last year
sail-sg / I-FSJ
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
☆65Updated 9 months ago
ajyl / dpo_toxic
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.
☆82Updated 7 months ago
javiferran / sae_entities
☆63Updated 7 months ago
ChnQ / TracingLLM
☆30Updated last year
jaechan-repo / muse_bench
☆28Updated last year
jinhaoduan / SAR
[ACL 2024] Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models
☆57Updated last year
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆61Updated last year
OPTML-Group / Unlearn-Simple
[NeurIPS25] Official repo for "Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning"
☆33Updated 3 weeks ago
thu-coai / SafeUnlearning
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
☆32Updated last year
thu-ml / STAIR
Official codebase for "STAIR: Improving Safety Alignment with Introspective Reasoning"
☆75Updated 8 months ago
OPTML-Group / SOUL
Official repo for EMNLP'24 paper "SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning"
☆27Updated last year
zhxieml / remiss-jailbreak
☆32Updated last year
bethgelab / sober-reasoning
A Sober Look at Language Model Reasoning
☆85Updated 2 weeks ago
jinzhuoran / RWKU
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024
☆83Updated last year
yihuaihong / ConceptVectors
[EMNLP 2025 Main] ConceptVectors Benchmark and Code for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces"
☆35Updated 2 months ago
deeplearning-wisc / picle
Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)
☆26Updated last year