keven980716 / weak-to-strong-deception
[ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"
☆13Updated 10 months ago
Alternatives and similar repositories for weak-to-strong-deception
Users that are interested in weak-to-strong-deception are comparing it to the libraries listed below
Sorting:
- ☆25Updated 11 months ago
- [NeurIPS 2024] "Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?"☆35Updated 3 months ago
- [AAAI 2024] MELO: Enhancing Model Editing with Neuron-indexed Dynamic LoRA☆25Updated last year
- In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)☆57Updated last year
- ☆36Updated 7 months ago
- Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)☆24Updated 10 months ago
- Official code for SEAL: Steerable Reasoning Calibration of Large Language Models for Free☆22Updated last month
- ☆35Updated last year
- Representation Surgery for Multi-Task Model Merging. ICML, 2024.☆45Updated 7 months ago
- Codes for Merging Large Language Models☆29Updated 9 months ago
- RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024☆74Updated 7 months ago
- EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue☆35Updated 5 months ago
- What Makes a Reward Model a Good Teacher? An Optimization Perspective☆28Updated last month
- [NeurIPS 2023] Github repository for "Composing Parameter-Efficient Modules with Arithmetic Operations"☆61Updated last year
- SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Model https://arxiv.org/pdf/2411.02433☆25Updated 5 months ago
- Code for paper "Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models."☆42Updated 6 months ago
- Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning☆30Updated 5 months ago
- ☆18Updated last month
- [ACL 2024] Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models☆49Updated 8 months ago
- Mosaic IT: Enhancing Instruction Tuning with Data Mosaics☆18Updated 3 months ago
- A Task of Fictitious Unlearning for VLMs☆15Updated last month
- Code for "CREAM: Consistency Regularized Self-Rewarding Language Models", ICLR 2025.☆21Updated 2 months ago
- Code for Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities (NeurIPS'24)☆21Updated 4 months ago
- SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities☆14Updated last month
- Official Code and data for ACL 2024 finding, "An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models"☆19Updated 6 months ago
- [NeurIPS 2024 Spotlight] EMR-Merging: Tuning-Free High-Performance Model Merging☆58Updated 2 months ago
- Lightweight Adapting for Black-Box Large Language Models☆22Updated last year
- Code for "A Sober Look at Progress in Language Model Reasoning" paper☆45Updated last week
- ☆27Updated last year
- code for Learning the Unlearned: Mitigating Feature Suppression in Contrastive Learning☆16Updated 9 months ago