mukobi / welfare-diplomacy
General-Sum variant of the game Diplomacy for evaluating AIs.
☆23Updated 7 months ago
Related projects ⓘ
Alternatives and complementary repositories for welfare-diplomacy
- ☆122Updated 3 weeks ago
- Interpreting how transformers simulate agents performing RL tasks☆73Updated last year
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆62Updated last year
- Sparse Autoencoder Training Library☆27Updated 3 weeks ago
- Rewarded soups official implementation☆51Updated last year
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"☆84Updated 8 months ago
- Code and data for the paper "Understanding Hidden Context in Preference Learning: Consequences for RLHF"☆27Updated 11 months ago
- Algebraic value editing in pretrained language models☆57Updated last year
- LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces☆78Updated last year
- ☆73Updated 4 months ago
- ☆32Updated last year
- Redwood Research's transformer interpretability tools☆12Updated 2 years ago
- Research Code for "ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL"☆108Updated 7 months ago
- Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner☆15Updated 4 months ago
- Steering Llama 2 with Contrastive Activation Addition☆98Updated 5 months ago
- AdaPlanner: Language Models for Decision Making via Adaptive Planning from Feedback☆96Updated last year
- Repo for: When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment☆38Updated last year
- Mechanistic Interpretability for Transformer Models☆49Updated 2 years ago
- [ACL 2024] Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View☆102Updated 6 months ago
- ☆188Updated last month
- Hypothetical Minds is an autonomous LLM-based agent for diverse multi-agent settings, integrating a Theory of Mind module Theory of Mind …☆18Updated 4 months ago
- Benchmarking LLMs' Gaming Ability in Multi-Agent Environments☆39Updated last month
- Interpretable text embeddings by asking LLMs yes/no questions (NeurIPS 2024)☆22Updated last week
- Measuring the situational awareness of language models☆33Updated 9 months ago
- The Prism Alignment Project☆37Updated 6 months ago
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision☆97Updated 2 months ago
- ☆98Updated 3 months ago
- ☆33Updated 9 months ago
- Inspecting and Editing Knowledge Representations in Language Models☆108Updated last year
- ☆54Updated 2 years ago