Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"
☆127Mar 22, 2024Updated last year
Alternatives and similar repositories for llm_debate
Users that are interested in llm_debate are comparing it to the libraries listed below
Sorting:
- ☆13Jun 4, 2024Updated last year
- Official codebase for "Analyzing the Generalization and Reliability of Steering Vectors"☆19Dec 14, 2024Updated last year
- ☆17Dec 21, 2023Updated 2 years ago
- TACL 2025: Investigating Adversarial Trigger Transfer in Large Language Models☆19Aug 17, 2025Updated 6 months ago
- Code for Paper (Policy Optimization in RLHF: The Impact of Out-of-preference Data)☆28Dec 19, 2023Updated 2 years ago
- Self-Supervised Alignment with Mutual Information☆20May 24, 2024Updated last year
- Official Implementation of "Can Learned Optimization Make Reinforcement Learning Less Difficult"☆30Dec 15, 2025Updated 2 months ago
- ☆52Oct 23, 2023Updated 2 years ago
- Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments (Zhou et al., EMNLP 2024)☆14Oct 3, 2024Updated last year
- [ICLR 2025] Code&Data for the paper "Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization"☆14Jun 21, 2024Updated last year
- The repository contains code for Adaptive Data Optimization☆32Dec 9, 2024Updated last year
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆48Jan 17, 2024Updated 2 years ago
- Official Code for our paper: "Language Models Learn to Mislead Humans via RLHF""☆19Oct 11, 2024Updated last year
- Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers☆28Mar 1, 2025Updated last year
- Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".☆131Mar 9, 2024Updated last year
- ☆177Jul 24, 2024Updated last year
- [ACL 2024] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization☆29Jul 9, 2024Updated last year
- ☆18Feb 25, 2026Updated last week
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision☆124Sep 9, 2024Updated last year
- Code for Paper (Preserving Diversity in Supervised Fine-tuning of Large Language Models)☆52May 12, 2025Updated 9 months ago
- Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Ou…☆32Apr 20, 2024Updated last year
- ☆23Jan 17, 2025Updated last year
- ☆284Mar 2, 2024Updated 2 years ago
- ☆23Oct 4, 2024Updated last year
- [ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"☆67Jun 9, 2025Updated 8 months ago
- [NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct☆191Jan 16, 2025Updated last year
- Official implementation for "Law of the Weakest Link: Cross capabilities of Large Language Models"☆43Oct 1, 2024Updated last year
- IIG-RL-Benchmark is a library for training and evaluating game theoretical or deep RL algorithms on OpenSpiel games.☆25Nov 18, 2025Updated 3 months ago
- ☆22Feb 26, 2024Updated 2 years ago
- Scalable Opponent Shaping Experiments in JAX☆25Apr 13, 2024Updated last year
- ☆21Jul 26, 2025Updated 7 months ago
- ☆125Nov 13, 2023Updated 2 years ago
- Open source replication of Anthropic's Crosscoders for Model Diffing☆64Oct 27, 2024Updated last year
- This is the official implementation for our ACL 2024 paper: "Causal Estimation of Memorisation Profiles".☆24Mar 25, 2025Updated 11 months ago
- ☆10Nov 17, 2022Updated 3 years ago
- Code for experiments on self-prediction as a way to measure introspection in LLMs☆16Dec 10, 2024Updated last year
- Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives☆70Feb 22, 2024Updated 2 years ago
- ☆30Jun 19, 2023Updated 2 years ago
- ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.☆93May 9, 2024Updated last year