[ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"
☆14Jun 21, 2024Updated last year
Alternatives and similar repositories for weak-to-strong-deception
Users that are interested in weak-to-strong-deception are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆16Mar 22, 2025Updated last year
- Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts☆16Feb 26, 2024Updated 2 years ago
- Code for the paper "Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models" (NAACL-…☆44Jul 26, 2021Updated 4 years ago
- Official repository for paper "DeepCritic: Deliberate Critique with Large Language Models"☆41Jun 24, 2025Updated 9 months ago
- ☆46Jun 24, 2025Updated 9 months ago
- Wordpress hosting with auto-scaling on Cloudways • AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- How Robust are Randomized Smoothing based Defenses to Data Poisoning? (CVPR 2021)☆14Jul 16, 2021Updated 4 years ago
- Code for the paper "Rethinking Stealthiness of Backdoor Attack against NLP Models" (ACL-IJCNLP 2021)☆24Dec 9, 2021Updated 4 years ago
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆48Jan 17, 2024Updated 2 years ago
- ☆52Oct 23, 2023Updated 2 years ago
- Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"☆187May 20, 2025Updated 10 months ago
- A fast procedural scene generation framework☆22Dec 31, 2025Updated 2 months ago
- An official implementation of "Rethinking Graph Backdoor Attacks: A Distribution-Preserving Perspective" (KDD 2024)☆12Sep 16, 2024Updated last year
- The rule-based evaluation subset and code implementation of Omni-MATH☆27Dec 23, 2024Updated last year
- [ICLR 2025 Spotlight] Weak-to-strong preference optimization: stealing reward from weak aligned model☆18Feb 24, 2025Updated last year
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- Reinforcing General Reasoning without Verifiers☆97Jun 24, 2025Updated 9 months ago
- Methods and evaluation for aligning language models temporally☆30Mar 2, 2024Updated 2 years ago
- [ICML‘2024] "LoCoCo: Dropping In Convolutions for Long Context Compression", Ruisi Cai, Yuandong Tian, Zhangyang Wang, Beidi Chen☆17Sep 7, 2024Updated last year
- Your finetuned model's back to its original safety standards faster than you can say "SafetyLock"!☆11Oct 16, 2024Updated last year
- [TPAMI Major Revision] Resource Summary for paper "Unveiling the Unseen: A Comprehensive Survey on Explainable Anomaly Detection in Image…☆35Apr 27, 2025Updated 11 months ago
- my commonly-used tools☆64Jan 7, 2025Updated last year
- Code for ICLR 2025 Paper "GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment"☆20Feb 10, 2025Updated last year
- From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning.☆25Oct 7, 2025Updated 5 months ago
- A collection of resources for graph-based semi-supervised learning (GSSL).☆20Aug 30, 2021Updated 4 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- [WSDM 2026] LookAhead Tuning: Safer Language Models via Partial Answer Previews☆17Dec 14, 2025Updated 3 months ago
- Measuring the situational awareness of language models☆40Feb 12, 2024Updated 2 years ago
- Official implementation of ICML 2025 paper "Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach"☆11May 27, 2025Updated 10 months ago
- Repo for paper: Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge☆14Feb 20, 2024Updated 2 years ago
- ☆11Jan 19, 2025Updated last year
- A toolkit for testing and improving named entity recognition [ESEC/FSE'23]☆11Aug 31, 2023Updated 2 years ago
- ☆35Jul 2, 2025Updated 8 months ago
- Code for paper 'Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse'☆13Aug 2, 2024Updated last year
- Tasks for describing differences between text distributions.☆17Aug 9, 2024Updated last year
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- Implementation of Direct Preference Optimization☆17Jul 17, 2023Updated 2 years ago
- ☆13Sep 12, 2024Updated last year
- Spectral Perturbation Meets Incomplete Multi-view Data, In IJCAI-2019☆20May 18, 2021Updated 4 years ago
- [TKDE 2024, CIKM 2022] SLA²P: Self-supervised Anomaly Detection with Adversarial Perturbation.☆39Dec 26, 2024Updated last year
- Localized Sparse Incomplete Multi-view Clustering☆25May 17, 2023Updated 2 years ago
- An official implementation of "Catastrophic Failure of LLM Unlearning via Quantization" (ICLR 2025)☆37Feb 22, 2025Updated last year
- ☆15Jul 14, 2022Updated 3 years ago