[ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"
☆15Jun 21, 2024Updated last year
Alternatives and similar repositories for weak-to-strong-deception
Users that are interested in weak-to-strong-deception are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts☆16Feb 26, 2024Updated 2 years ago
- Code for the paper "Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models" (NAACL-…☆45Jul 26, 2021Updated 4 years ago
- Official repository for paper "DeepCritic: Deliberate Critique with Large Language Models"☆41Jun 24, 2025Updated 10 months ago
- ☆47Jun 24, 2025Updated 10 months ago
- How Robust are Randomized Smoothing based Defenses to Data Poisoning? (CVPR 2021)☆14Jul 16, 2021Updated 4 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Applies ROME and MEMIT on Mamba-S4 models☆15Apr 5, 2024Updated 2 years ago
- Code for the paper "RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models" (EMNLP 2021)☆25Oct 21, 2021Updated 4 years ago
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆48Jan 17, 2024Updated 2 years ago
- ☆51Oct 23, 2023Updated 2 years ago
- Benchmark of crystal structure prediction algorithms