sam-paech / diplobenchLinks
Benchmark for LLMs playing full press diplomacy
☆57Updated 9 months ago
Alternatives and similar repositories for diplobench
Users that are interested in diplobench are comparing it to the libraries listed below
Sorting:
- ☆125Updated last year
- ☆67Updated 5 months ago
- ☆183Updated last year
- look how they massacred my boy☆63Updated last year
- Train an adapter for any embedding model in under a minute☆129Updated 8 months ago
- Public repository containing METR's DVC pipeline for eval data analysis☆144Updated 8 months ago
- An easy-to-understand framework for LLM samplers that rewind and revise generated tokens☆150Updated 10 months ago
- Inference-time scaling for LLMs-as-a-judge.☆316Updated last month
- A framework for orchestrating AI agents using a mermaid graph☆77Updated last year
- Implementation of the board game Codenames, re-imagined as a collaborative game between LLM agents☆108Updated 9 months ago
- Train your own SOTA deductive reasoning model☆107Updated 9 months ago
- An introduction to LLM Sampling☆79Updated last year
- A comprehensive repository of reasoning tasks for LLMs (and beyond)☆453Updated last year
- A DSPy-based implementation of the tree of thoughts method (Yao et al., 2023) for generating persuasive arguments☆94Updated 2 months ago
- A framework for optimizing DSPy programs with RL☆298Updated last month
- ☆131Updated 11 months ago
- An automated tool for discovering insights from research papaer corpora☆137Updated last year
- 📝 Reference-Free automatic summarization evaluation with potential hallucination detection☆103Updated last year
- PageRank for LLMs☆51Updated 3 months ago
- Plotting (entropy, varentropy) for small LMs☆99Updated 7 months ago
- autologic is a Python package that implements the SELF-DISCOVER framework proposed in the paper SELF-DISCOVER: Large Language Models Self…☆60Updated last year
- ReDel is a toolkit for researchers and developers to build, iterate on, and analyze recursive multi-agent systems. (EMNLP 2024 Demo)☆89Updated last week
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)☆91Updated 10 months ago
- Official repository for "DynaSaur: Large Language Agents Beyond Predefined Actions"☆351Updated 11 months ago
- Claude Deep Research config for Claude Code.☆222Updated 9 months ago
- Code for "Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs"☆85Updated 9 months ago
- A library for benchmarking the Long Term Memory and Continual learning capabilities of LLM based agents. With all the tests and code you…☆81Updated last year
- Official code for NeurIPS 2025 paper "AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise"☆113Updated this week
- ☆44Updated last year
- Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning☆235Updated 9 months ago