keven980716 / weak-to-strong-deceptionLinks

[ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"

☆13

Alternatives and similar repositories for weak-to-strong-deception

Users that are interested in weak-to-strong-deception are comparing it to the libraries listed below

Sorting:

EnnengYang / RepresentationSurgery
Representation Surgery for Multi-Task Model Merging. ICML, 2024.
☆47Updated last year
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆62Updated last year
ECNU-ICALK / MELO
[AAAI 2024] MELO: Enhancing Model Editing with Neuron-indexed Dynamic LoRA
☆26Updated last year
tmlr-group / NoisyRationales
[NeurIPS 2024] "Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?"
☆37Updated 5 months ago
which47 / LLMCL
Analyzing and Reducing Catastrophic Forgetting in Parameter Efficient Tuning
☆36Updated last year
ChnQ / TracingLLM
☆30Updated last year
princeton-pli / what-makes-good-rm
[NeurIPS 2025] What Makes a Reward Model a Good Teacher? An Optimization Perspective
☆41Updated 3 months ago
harveyhuang18 / EMR_Merging
[NeurIPS 2024 Spotlight] EMR-Merging: Tuning-Free High-Performance Model Merging
☆74Updated 9 months ago
EnnengYang / AdaMerging
AdaMerging: Adaptive Model Merging for Multi-Task Learning. ICLR, 2024.
☆97Updated last year
rdi-berkeley / awesome-RLVR-boundary
A curated list of resources on Reinforcement Learning with Verifiable Rewards (RLVR) and the reasoning capability boundary of Large Langu…
☆86Updated 2 weeks ago
Alsace08 / OOD-Math-Reasoning
[NeurIPS 2024] Code and Data Repo for Paper "Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning"
☆27Updated last year
bethgelab / sober-reasoning
A Sober Look at Language Model Reasoning
☆92Updated last month
Zayne-sprague / To-CoT-or-not-to-CoT
☆24Updated 8 months ago
luka-group / vlm-knowledge-conflict
Code for paper "Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models."
☆50Updated last year
Model-GLUE / Model-GLUE
☆18Updated last year
deeplearning-wisc / picle
Official code for ICML 2024 paper on Persona In-Context Learning (PICLe)
☆26Updated last year
jinhaoduan / SAR
[ACL 2024] Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models
☆60Updated last year
mathllm / Step-Controlled_DPO
☆23Updated last year
UKPLab / iclr2024-model-merging
This is the repository for "Model Merging by Uncertainty-Based Gradient Matching", ICLR 2024.
☆29Updated last year
circle-hit / SAPT
Code for ACL 2024 accepted paper titled "SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language …
☆38Updated 11 months ago
shizhediao / Black-Box-Prompt-Learning
Source code for the TMLR paper "Black-Box Prompt Learning for Pre-trained Language Models"
☆57Updated 2 years ago
yule-BUAA / MergeLLM
Codes for Merging Large Language Models
☆34Updated last year
SophieZheng998 / ALI-Agent
Official implementation for "ALI-Agent: Assessing LLMs'Alignment with Human Values via Agent-based Evaluation"
☆21Updated 4 months ago
hkust-nlp / PEM_composition
[NeurIPS 2023] Github repository for "Composing Parameter-Efficient Modules with Arithmetic Operations"
☆61Updated 2 years ago
SihengLi99 / LLM-Honesty-Survey
[2025-TMLR] A Survey on the Honesty of Large Language Models
☆64Updated last year
princeton-nlp / benign-data-breaks-safety
☆43Updated last year
junkangwu / beta-DPO
[NeurIPS 2024] Official code of $\beta$-DPO: Direct Preference Optimization with Dynamic $\beta$
☆50Updated last year
yaojin17 / Unlearning_LLM
[ACL 2024] Code and data for "Machine Unlearning of Pre-trained Large Language Models"
☆65Updated last year
JasonForJoy / Model-Editing-Hurt
EMNLP 2024: Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue
☆38Updated 7 months ago
lukahhcm / Awesome_Environment_Scaling
Resources and paper list for 'Scaling Environments for Agents'. This repository accompanies our survey on how environments contribute to …
☆48Updated last week