tmlr-group / G-effect
[ICLR 2025] "Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond"
☆11Updated last month
Alternatives and similar repositories for G-effect:
Users that are interested in G-effect are comparing it to the libraries listed below
- This is the official code for the paper "Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable".☆12Updated 3 weeks ago
- [NeurIPS 2024] "Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?"☆35Updated 2 months ago
- ☆21Updated 3 weeks ago
- translation of VHL repo in paddle☆25Updated last year
- ☆34Updated 6 months ago
- This is the official code for the paper "Vaccine: Perturbation-aware Alignment for Large Language Models" (NeurIPS2024)☆41Updated 4 months ago
- [ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion☆36Updated 5 months ago
- ☆24Updated last year
- SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities☆12Updated last week
- ☆16Updated 7 months ago
- ☆17Updated 2 weeks ago
- Identification of the Adversary from a Single Adversarial Example (ICML 2023)☆9Updated 8 months ago
- ☆13Updated 7 months ago
- "In-Context Unlearning: Language Models as Few Shot Unlearners". Martin Pawelczyk, Seth Neel* and Himabindu Lakkaraju*; ICML 2024.☆24Updated last year
- Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep☆83Updated 8 months ago
- ☆53Updated 8 months ago
- Official implementation for "ALI-Agent: Assessing LLMs'Alignment with Human Values via Agent-based Evaluation"☆16Updated last month
- ☆20Updated 3 months ago
- ☆11Updated last year
- Code for Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities (NeurIPS'24)☆17Updated 3 months ago
- [NeurIPS2023] "Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced Transfer Learning" by Yihua Zhang*, Yimeng Zhang*,…☆11Updated last year
- [ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"☆13Updated 9 months ago
- [CCS-LAMPS'24] LLM IP Protection Against Model Merging☆14Updated 5 months ago
- An implementation of SEAL: Safety-Enhanced Aligned LLM fine-tuning via bilevel data selection.☆12Updated last month
- The official code of the paper "A Closer Look at Machine Unlearning for Large Language Models".☆24Updated 3 months ago
- This is the official code for the paper "Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning" (NeurIPS2024)☆19Updated 6 months ago
- An implementation for MLLM oversensitivity evaluation☆13Updated 4 months ago
- This repo is for the safety topic, including attacks, defenses and studies related to reasoning and RL☆14Updated this week
- Github repo for NeurIPS 2024 paper "Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models"☆12Updated 5 months ago
- This is the project for IRM methods☆12Updated 3 years ago