[ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"
☆15Jun 21, 2024Updated last year
Alternatives and similar repositories for weak-to-strong-deception
Users that are interested in weak-to-strong-deception are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts☆16Feb 26, 2024Updated 2 years ago
- Code for the paper "Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models" (NAACL-…☆45Jul 26, 2021Updated 4 years ago
- Official repository for paper "DeepCritic: Deliberate Critique with Large Language Models"☆41Jun 24, 2025Updated 11 months ago
- ☆47Jun 24, 2025Updated 11 months ago
- How Robust are Randomized Smoothing based Defenses to Data Poisoning? (CVPR 2021)☆14Jul 16, 2021Updated 4 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Applies ROME and MEMIT on Mamba-S4 models☆15Apr 5, 2024Updated 2 years ago
- Code for the paper "RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models" (EMNLP 2021)☆25Oct 21, 2021Updated 4 years ago
- Code for the paper "Rethinking Stealthiness of Backdoor Attack against NLP Models" (ACL-IJCNLP 2021)☆24Dec 9, 2021Updated 4 years ago
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆48Jan 17, 2024Updated 2 years ago
- [ICLR 2026] Meta-RL Induces Exploration in Language Agents☆40Feb 1, 2026Updated 3 months ago
- ☆51Oct 23, 2023Updated 2 years ago
- Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"☆190May 20, 2025Updated last year
- Official Repository for the ICLR 2022 paper "Generalization of Neural Combinatorial Solvers through the Lens of Adversarial Robustness"☆13Nov 20, 2022Updated 3 years ago
- The rule-based evaluation subset and code implementation of Omni-MATH☆27Dec 23, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- ☆13Nov 20, 2023Updated 2 years ago
- [ICLR 2025 Spotlight] Weak-to-strong preference optimization: stealing reward from weak aligned model☆18Feb 24, 2025Updated last year
- Reinforcing General Reasoning without Verifiers☆100Jun 24, 2025Updated 11 months ago
- Methods and evaluation for aligning language models temporally☆31Mar 2, 2024Updated 2 years ago
- Your finetuned model's back to its original safety standards faster than you can say "SafetyLock"!☆11Oct 16, 2024Updated last year
- [ICML‘2024] "LoCoCo: Dropping In Convolutions for Long Context Compression", Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen☆17Sep 7, 2024Updated last year
- From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning.☆25Oct 7, 2025Updated 7 months ago
- A collection of resources for graph-based semi-supervised learning (GSSL).☆20Aug 30, 2021Updated 4 years ago
- Teaching Models to Express Their Uncertainty in Words☆38May 26, 2022Updated 4 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- [WSDM 2026] LookAhead Tuning: Safer Language Models via Partial Answer Previews☆17Dec 14, 2025Updated 5 months ago
- Measuring the situational awareness of language models☆41Feb 12, 2024Updated 2 years ago
- Code for CascadeBERT, Findings of EMNLP 2021☆12Mar 30, 2022Updated 4 years ago
- Repo for paper: Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge☆14Feb 20, 2024Updated 2 years ago
- ☆15Sep 8, 2023Updated 2 years ago
- Code for ICLR 2025 Paper "GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment"☆23Feb 10, 2025Updated last year
- ☆18Apr 7, 2025Updated last year
- ☆11Jan 19, 2025Updated last year
- ☆10Mar 13, 2023Updated 3 years ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Code for paper 'Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse'☆14Aug 2, 2024Updated last year
- [NeurIPS 2021] "Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models" by Boxin Wang*, Chejian Xu*, Shuoh…☆13Apr 3, 2023Updated 3 years ago
- Implementation of Direct Preference Optimization☆17Jul 17, 2023Updated 2 years ago
- ☆13Sep 12, 2024Updated last year
- Reinforcement Learning for Cut Selection☆12Dec 8, 2022Updated 3 years ago
- Spectral Perturbation Meets Incomplete Multi-view Data, In IJCAI-2019☆20May 18, 2021Updated 5 years ago
- [TKDE 2024, CIKM 2022] SLA²P: Self-supervised Anomaly Detection with Adversarial Perturbation.☆39Dec 26, 2024Updated last year