Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.
☆85Dec 9, 2025Updated 2 months ago
Alternatives and similar repositories for step_game
Users that are interested in step_game are comparing it to the libraries listed below
Sorting:
- Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claud…☆31Mar 20, 2025Updated 11 months ago
- LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each oth…☆35Mar 20, 2025Updated 11 months ago
- Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies a…☆39Apr 10, 2025Updated 10 months ago
- Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a sm…☆63Sep 22, 2025Updated 5 months ago
- Documents the style side of the short-story Creative Writing LLM benchmark: we generated many short stories with a range of LLMs, then an…☆22Dec 18, 2025Updated 2 months ago
- A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private co…☆300Jan 7, 2026Updated last month
- This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, moti…☆347Feb 6, 2026Updated 3 weeks ago
- Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words☆198Feb 23, 2026Updated last week
- The BAZAAR challenges LLMs to navigate the double-auction marketplace, where buyers and sellers must make strategic decisions with incomp…☆35Jul 30, 2025Updated 7 months ago
- documentation used in my projects☆16Updated this week
- Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.☆243Aug 7, 2025Updated 6 months ago
- qwen3 experiments☆34Jul 1, 2025Updated 8 months ago
- A benchmark for conversational bargaining by language models. In each 20‑round match one LLM plays buyer, one plays seller, and both hold…☆34Aug 21, 2025Updated 6 months ago
- ☆17Aug 5, 2025Updated 7 months ago
- FamilyBench evaluation tool for testing the relational reasoning capabilities of Large Language Models (LLMs).☆41Oct 6, 2025Updated 4 months ago
- Tutorial for TikZ☆11Apr 3, 2025Updated 11 months ago
- 🕷️ n8n Community Node for Scrappey API – Automate web scraping and data extraction with advanced anti-bot blocking technology, seamlessl…☆16Feb 2, 2026Updated last month
- ☆12Jan 19, 2024Updated 2 years ago
- Personal Finance Expense Tracker☆20Nov 14, 2025Updated 3 months ago
- A wrapper around libssh2 for .NET☆29Jan 21, 2026Updated last month
- ☆16Jul 1, 2025Updated 8 months ago
- SING: SDE Inference via Natural Gradients☆36Dec 9, 2025Updated 2 months ago
- Code for paper "Reasoning Like an Economist: Post-Training on Economic Problems Induces Strategic Generalization in LLMs"☆12Jun 11, 2025Updated 8 months ago
- A simple, interactive web tool to compare pricing and performance metrics of various AI models.☆16Feb 22, 2026Updated last week
- ☆46Jun 20, 2025Updated 8 months ago
- Camera app drawn on SkiaSharp canvas with real-time SKSL shaders. Built-in desktop shader editor. Made with DrawnUI for .NET MAUI.☆22Feb 20, 2026Updated last week
- Compare Naive Bayes, SVM, XGBoost, Bagging, AdaBoost, K-Nearest Neighbors, Random Forests for classification of Malaria Cells☆11Jun 5, 2019Updated 6 years ago
- Groq-powered MAD: The first work to explore Multi-Agent Debate with Large Language Models :D☆12Jul 5, 2024Updated last year
- Visual image composition helper node for ComfyUI. Grid, diagonals, Phi Grid, Pyramid, Golden Triangles, Perspective lines. Color settings…☆16Jul 10, 2025Updated 7 months ago
- This repo documents my workflows and stack to run comfy ui GenANI assist under windows☆30Feb 14, 2026Updated 2 weeks ago
- Внедрение в инструменты BPM (Business Process Management software tools моделирования верхнеуровневых и детальных процессов) и EA (от биз…☆17Updated this week
- A powerful, interactive Python CLI for converting, manipulating, and inspecting media files using FFmpeg 🎬☆17Feb 10, 2026Updated 3 weeks ago
- [NeurIPS 2025] Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking☆22Oct 22, 2025Updated 4 months ago
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆17Feb 9, 2026Updated 3 weeks ago
- AI code generation and improvement☆34Aug 29, 2025Updated 6 months ago
- [NeurIPS 2025, Spotlight]: Ambient-o: Training Good models with Bad Data.☆31Jan 21, 2026Updated last month
- A simple external application for Windows that allows you to scan an existing custom_nodes directory and generate a list of the nodes ins…☆20Jul 6, 2025Updated 7 months ago
- Real-time webcam demo with SmolVLM(mlx-community/SmolVLM-Instruct-4bit) and MLX-VLM☆25Jun 12, 2025Updated 8 months ago
- Perl 5 Metaconfig☆16Feb 12, 2026Updated 3 weeks ago