lechmazur/nyt-connections

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/lechmazur/nyt-connections)

lechmazur / nyt-connections

Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words

☆230

Alternatives and similar repositories for nyt-connections

Users that are interested in nyt-connections are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

lechmazur / debate
View on GitHub
Adversarial multi-turn benchmark for LLM debate quality, using side-swapped matchups and multi-model judging to rank models by judged deb…
☆28Jul 17, 2026Updated last week
lechmazur / buyout_game
View on GitHub
A multi-agent benchmark where eight LLMs play a money-driven elimination game with private transfers and a buyout endgame, and are ranked…
☆18May 27, 2026Updated last month
lechmazur / generalization
View on GitHub
Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a sm…
☆72Apr 16, 2026Updated 3 months ago
lechmazur / pgg_bench
View on GitHub
Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies a…
☆41Apr 10, 2025Updated last year
lechmazur / step_game
View on GitHub
Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLM…
☆89Dec 9, 2025Updated 7 months ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
lechmazur / pact
View on GitHub
A benchmark for conversational bargaining by language models. In each 20‑round match one LLM plays buyer, one plays seller, and both hold…
☆44Jun 23, 2026Updated last month
lechmazur / divergent
View on GitHub
LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each oth…
☆35Mar 20, 2025Updated last year
lechmazur / elimination_game
View on GitHub
A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private co…
☆301Jan 7, 2026Updated 6 months ago
lechmazur / confabulations
View on GitHub
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
☆247Aug 7, 2025Updated 11 months ago
lechmazur / deception
View on GitHub
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claud…
☆33Mar 20, 2025Updated last year
lechmazur / writing
View on GitHub
This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, moti…
☆408Updated this week
lechmazur / persuasion
View on GitHub
LLM Persuasion Benchmark tests whether one language model can change another model’s stated position over the course of a multi-turn conv…
☆31Mar 27, 2026Updated 3 months ago
lechmazur / bazaar
View on GitHub
The BAZAAR challenges LLMs to navigate the double-auction marketplace, where buyers and sellers must make strategic decisions with incomp…
☆37Jul 30, 2025Updated 11 months ago
lechmazur / emergent_collusion
View on GitHub
Systemic, uninstructed collusion among frontier LLMs in a simulated bidding environment
☆18Jul 15, 2025Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
Orolol / familyBench
View on GitHub
FamilyBench evaluation tool for testing the relational reasoning capabilities of Large Language Models (LLMs).
☆47May 4, 2026Updated 2 months ago
thad0ctor / KrunchWrapper
View on GitHub
☆18Jul 1, 2025Updated last year
hyperfocAIs / Attend
View on GitHub
Attend - to what matters.
☆17Feb 22, 2025Updated last year
fairydreaming / lineage-bench
View on GitHub
Testing LLM reasoning abilities with lineage relationship quizzes.
☆44Mar 10, 2026Updated 4 months ago
NimbleEdge / sparse_transformers
View on GitHub
Sparse Inferencing for transformer based LLMs
☆219Mar 25, 2026Updated 4 months ago
AaronFeng753 / Better-Qwen3
View on GitHub
Auto Thinking Mode switch for Qwen3 in Open webui
☆72May 8, 2025Updated last year
bjodah / llm-multi-backend-container
View on GitHub
Docker/podman container for llama.cpp/vllm/exllamav{2,3} orchestrated using llama-swap
☆18Jun 17, 2026Updated last month
johnbean393 / SVGBench
View on GitHub
SVGBench: A challenging LLM benchmark that tests knowledge, coding, physical reasoning capabilities of LLMs.
☆72Feb 12, 2026Updated 5 months ago
Pranit-Harekar / better-naming
View on GitHub
A genie in a bottle, ready to grant developers' wishes for well-named variables
☆11Feb 23, 2023Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
cpldcpu / llmbenchmark
View on GitHub
Various LLM Benchmarks
☆26Feb 20, 2026Updated 5 months ago
fajrmn / kokoro-on-browser
View on GitHub
☆16Feb 1, 2025Updated last year
EQ-bench / EQ-Bench
View on GitHub
A benchmark for emotional intelligence in large language models
☆444Jul 26, 2024Updated last year
ShmuelRonen / ComfyUI-Gemini_TTS
View on GitHub
A powerful ComfyUI custom node that brings Google's Gemini TTS capabilities directly to your workflow. Generate high-quality speech with …
☆22May 23, 2025Updated last year
CritPt-Benchmark / CritPt
View on GitHub
☆84Nov 21, 2025Updated 8 months ago
fishiatee / Tumera
View on GitHub
Yet another frontend for LLM, written using .NET and WinUI 3
☆11Sep 14, 2025Updated 10 months ago
lliu606 / COSMOS
View on GitHub
☆20Feb 2, 2026Updated 5 months ago
QwenLM / ParScale
View on GitHub
Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling
☆480May 17, 2025Updated last year
huggingface / wikirace-llms
View on GitHub
☆27May 7, 2025Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
sam-paech / slop-forensics
View on GitHub
☆357Nov 1, 2025Updated 8 months ago
callbacked / vela
View on GitHub
An LLM Client for the PS Vita
☆13Jun 23, 2025Updated last year
lechmazur / sycophancy
View on GitHub
LLM benchmark and leaderboard for narrator-bias sycophancy, opposite-narrator contradictions, and judgment consistency.
☆53Jun 11, 2026Updated last month
lechmazur / position_bias
View on GitHub
A benchmark for testing whether LLM judges keep the same preference when two lightly edited versions of the same story are shown in oppos…
☆15Jun 11, 2026Updated last month
egozverev / Should-It-Be-Executed-Or-Processed
View on GitHub
Accompanying code and SEP dataset for the "Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?" paper.
☆61Apr 20, 2026Updated 3 months ago
DunZhang / Jasper-Token-Compression-Training
View on GitHub
The training codes of Jasper-Token-Compression-600M
☆20Nov 19, 2025Updated 8 months ago
mrconter1 / BenchmarkAggregator
View on GitHub
Comprehensive LLM evaluation framework: GPQA Diamond to Chatbot Arena. Tests all major models equally, easily extensible.
☆17Aug 22, 2024Updated last year