lechmazur/elimination_game

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/lechmazur/elimination_game)

lechmazur / elimination_game

A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other

☆302

Alternatives and similar repositories for elimination_game

Users that are interested in elimination_game are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

lechmazur / step_game
View on GitHub
Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLM…
☆88Dec 9, 2025Updated 7 months ago
lechmazur / nyt-connections
View on GitHub
Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words
☆228Jul 1, 2026Updated last week
lechmazur / divergent
View on GitHub
LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each oth…
☆35Mar 20, 2025Updated last year
lechmazur / deception
View on GitHub
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claud…
☆33Mar 20, 2025Updated last year
lechmazur / pgg_bench
View on GitHub
Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies a…
☆41Apr 10, 2025Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
lechmazur / writing
View on GitHub
This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, moti…
☆398Jun 10, 2026Updated 3 weeks ago
lechmazur / bazaar
View on GitHub
The BAZAAR challenges LLMs to navigate the double-auction marketplace, where buyers and sellers must make strategic decisions with incomp…
☆37Jul 30, 2025Updated 11 months ago
lechmazur / generalization
View on GitHub
Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a sm…
☆71Apr 16, 2026Updated 2 months ago
lechmazur / confabulations
View on GitHub
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
☆248Aug 7, 2025Updated 11 months ago
jd-3d / SOLOBench
View on GitHub
☆136May 2, 2025Updated last year
lechmazur / pact
View on GitHub
A benchmark for conversational bargaining by language models. In each 20‑round match one LLM plays buyer, one plays seller, and both hold…
☆44Jun 23, 2026Updated 2 weeks ago
0n4li / collab-ai
View on GitHub
Collaborative AI Model
☆11Nov 27, 2024Updated last year
CelVoxes / ceLLama
View on GitHub
Cell type annotation with local Large Language Models (LLMs) - Ensuring privacy and speed with extensive customized reports
☆152Oct 25, 2024Updated last year
tomekkorbak / bliss-attractors
View on GitHub
A toy Inspect implementation of the Bliss Attractor eval from Claude 4 System Card Welfare Assessment
☆40Jun 5, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
adenta / fire_red_agent
View on GitHub
☆165Mar 24, 2025Updated last year
rawwerks / MineTuning
View on GitHub
Mine-tuning is a methodology for synchronizing human and AI attention.
☆21Jun 16, 2024Updated 2 years ago
Foreseerr / TScale
View on GitHub
☆198May 5, 2025Updated last year
gnh1201 / notebooklm-rest-api
View on GitHub
A REST API wrapper for Google NotebookLM powered by notebooklm-py
☆75Mar 2, 2026Updated 4 months ago
lechmazur / writing_styles
View on GitHub
Documents the style side of the short-story Creative Writing LLM benchmark: we generated many short stories with a range of LLMs, then an…
☆25Dec 18, 2025Updated 6 months ago
formulake / comfyuinode-scan-clone
View on GitHub
A simple external application for Windows that allows you to scan an existing custom_nodes directory and generate a list of the nodes ins…
☆21Jul 6, 2025Updated last year
trymeka / agent
View on GitHub
state of the art browsing agent (WebArena 72.7%)
☆365Oct 2, 2025Updated 9 months ago
rajansagarwal / compression
View on GitHub
llms can learn their own context compression via RL
☆43Nov 26, 2025Updated 7 months ago
goodfire-ai / sdxl-turbo-interpretability
View on GitHub
☆49May 27, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
rcarmo / homekit-steam-user-switcher
View on GitHub
A way to remotely switch Steam users using HomeKit
☆41Jun 28, 2026Updated last week
vimode / Advent-Calendars-For-Developers
View on GitHub
Advent Calendars for Web Developers
☆21Dec 14, 2025Updated 6 months ago
chapmanjacobd / hn_mining
View on GitHub
hackernews data
☆33Dec 14, 2025Updated 6 months ago
0xSero / mem-layer
View on GitHub
Organise your AI's memories with graph database entries
☆78Dec 2, 2025Updated 7 months ago
EQ-bench / EQ-Bench
View on GitHub
A benchmark for emotional intelligence in large language models
☆438Jul 26, 2024Updated last year
aidanmclaughlin / AidanBench
View on GitHub
Aidan Bench attempts to measure <big_model_smell> in LLMs.
☆319Jun 26, 2025Updated last year
sam-paech / spiral-bench
View on GitHub
☆50Dec 2, 2025Updated 7 months ago
pepicrft / openclaw-plugin-vault
View on GitHub
☆40Feb 5, 2026Updated 5 months ago
akashjss / orpheus-tts-local-webui
View on GitHub
Run Orpheus 3B Locally with Gradio UI, Standalone App
☆24Apr 1, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
exa-labs / benchmarks
View on GitHub
Open benchmarks for evaluating search APIs
☆118Jun 1, 2026Updated last month
qntm / hyperoperate
View on GitHub
Hyperoperations for JavaScript!
☆12Jan 1, 2026Updated 6 months ago
jmurth1234 / ClaudePlayer
View on GitHub
An AI-powered game playing agent using Claude and PyBoy
☆42Mar 9, 2025Updated last year
joshrutkowski / applescript-mcp
View on GitHub
A macOS AppleScript MCP server
☆390Apr 19, 2025Updated last year
agentkube / agentkube
View on GitHub
Agentkube - Run Kubernetes Like Never Before
☆39Mar 1, 2026Updated 4 months ago
pkage / caffeine
View on GitHub
A fairly lightweight daemon that keeps your computer awake. Designed for rootless environments.
☆26May 3, 2019Updated 7 years ago
FransFaase / MES-replacement
View on GitHub
Investigation into replacing the MES compiler
☆35Jun 15, 2026Updated 3 weeks ago