[ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"
☆15Jun 21, 2024Updated last year
Alternatives and similar repositories for weak-to-strong-deception
Users that are interested in weak-to-strong-deception are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆16Mar 22, 2025Updated last year
- Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts☆15Feb 26, 2024Updated 2 years ago
- Code for the paper "Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models" (NAACL-…☆45Jul 26, 2021Updated 4 years ago
- Official repository for paper "DeepCritic: Deliberate Critique with Large Language Models"☆41Jun 24, 2025Updated 11 months ago
- Unofficial implementation of "Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection"☆27Jul 6, 2024Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- ☆47Jun 24, 2025Updated 11 months ago
- How Robust are Randomized Smoothing based Defenses to Data Poisoning? (CVPR 2021)☆14Jul 16, 2021Updated 4 years ago
- Applies ROME and MEMIT on Mamba-S4 models☆15Apr 5, 2024Updated 2 years ago
- Code for the paper "RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models" (EMNLP 2021)☆25Oct 21, 2021Updated 4 years ago
- [ICLR 2026] Meta-RL Induces Exploration in Language Agents☆42Feb 1, 2026Updated 4 months ago
- ☆56Oct 23, 2023Updated 2 years ago
- Benchmark of crystal structure prediction algorithms☆15Jun 9, 2025Updated last year
- Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"☆192May 20, 2025Updated last year
- Official Repository for the ICLR 2022 paper "Generalization of Neural Combinatorial Solvers through the Lens of Adversarial Robustness"☆13Nov 20, 2022Updated 3 years ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- The official code of Multi-player Nash Preference Optimization [ICLR 2026]☆35Feb 4, 2026Updated 4 months ago
- The rule-based evaluation subset and code implementation of Omni-MATH☆27Dec 23, 2024Updated last year
- An official implementation of "Rethinking Graph Backdoor Attacks: A Distribution-Preserving Perspective" (KDD 2024)☆12Sep 16, 2024Updated last year
- ☆13Nov 20, 2023Updated 2 years ago
- [ICLR 2025 Spotlight] Weak-to-strong preference optimization: stealing reward from weak aligned model☆18Feb 24, 2025Updated last year
- Reinforcing General Reasoning without Verifiers☆101Jun 24, 2025Updated 11 months ago
- Your finetuned model's back to its original safety standards faster than you can say "SafetyLock"!☆11Oct 16, 2024Updated last year
- Methods and evaluation for aligning language models temporally☆31Mar 2, 2024Updated 2 years ago
- 从socket开始实现pop3和smtp客户端,实现邮件编写、发送、接收、阅读、删除等基本功能。并实现简单界面(PyQt5)Start from socket to implement pop3 and smtp clients, to realize the basic …☆12Dec 24, 2023Updated 2 years ago
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- A python implement for Certifiable Robust Multi-modal Training☆20Jun 21, 2025Updated 11 months ago
- my commonly-used tools☆64Jan 7, 2025Updated last year
- From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning.☆25Oct 7, 2025Updated 8 months ago
- A collection of resources for graph-based semi-supervised learning (GSSL).☆20Aug 30, 2021Updated 4 years ago
- Teaching Models to Express Their Uncertainty in Words☆38May 26, 2022Updated 4 years ago
- Measuring the situational awareness of language models☆41Feb 12, 2024Updated 2 years ago
- Official implementation of ICML 2025 paper "Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach"☆12May 27, 2025Updated last year
- Code for CascadeBERT, Findings of EMNLP 2021☆12Mar 30, 2022Updated 4 years ago
- Repo for paper: Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge☆14Feb 20, 2024Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Code for ICLR 2025 Paper "GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment"☆24Feb 10, 2025Updated last year
- ☆19Apr 7, 2025Updated last year
- ☆11Jan 19, 2025Updated last year
- A toolkit for testing and improving named entity recognition [ESEC/FSE'23]☆11Aug 31, 2023Updated 2 years ago
- ☆14Aug 16, 2022Updated 3 years ago
- ☆10Mar 13, 2023Updated 3 years ago
- Code for paper 'Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse'☆14Aug 2, 2024Updated last year