lechmazur/generalization

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/lechmazur/generalization)

lechmazur / generalization

Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.

☆72

Alternatives and similar repositories for generalization

Users that are interested in generalization are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

lechmazur / debate
View on GitHub
Adversarial multi-turn benchmark for LLM debate quality, using side-swapped matchups and multi-model judging to rank models by judged deb…
☆28Jul 17, 2026Updated last week
lechmazur / buyout_game
View on GitHub
A multi-agent benchmark where eight LLMs play a money-driven elimination game with private transfers and a buyout endgame, and are ranked…
☆18May 27, 2026Updated last month
lechmazur / divergent
View on GitHub
LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each oth…
☆35Mar 20, 2025Updated last year
lechmazur / position_bias
View on GitHub
A benchmark for testing whether LLM judges keep the same preference when two lightly edited versions of the same story are shown in oppos…
☆15Jun 11, 2026Updated last month
lechmazur / persuasion
View on GitHub
LLM Persuasion Benchmark tests whether one language model can change another model’s stated position over the course of a multi-turn conv…
☆31Mar 27, 2026Updated 3 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
lechmazur / confabulations
View on GitHub
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
☆247Aug 7, 2025Updated 11 months ago
lechmazur / writing_styles
View on GitHub
Documents the style side of the short-story Creative Writing LLM benchmark: we generated many short stories with a range of LLMs, then an…
☆25Dec 18, 2025Updated 7 months ago
lechmazur / step_game
View on GitHub
Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLM…
☆89Dec 9, 2025Updated 7 months ago
lechmazur / deception
View on GitHub
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claud…
☆33Mar 20, 2025Updated last year
lechmazur / pgg_bench
View on GitHub
Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies a…
☆41Apr 10, 2025Updated last year
lechmazur / sycophancy
View on GitHub
LLM benchmark and leaderboard for narrator-bias sycophancy, opposite-narrator contradictions, and judgment consistency.
☆53Jun 11, 2026Updated last month
lechmazur / writing
View on GitHub
This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, moti…
☆408Updated this week
lechmazur / pact
View on GitHub
A benchmark for conversational bargaining by language models. In each 20‑round match one LLM plays buyer, one plays seller, and both hold…
☆44Jun 23, 2026Updated last month
lechmazur / nyt-connections
View on GitHub
Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words
☆230Jul 17, 2026Updated last week
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
lechmazur / elimination_game
View on GitHub
A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private co…
☆301Jan 7, 2026Updated 6 months ago
Orolol / familyBench
View on GitHub
FamilyBench evaluation tool for testing the relational reasoning capabilities of Large Language Models (LLMs).
☆47May 4, 2026Updated 2 months ago
alientony / Split-brain
View on GitHub
This is a training method to produce a split brain model
☆14Mar 7, 2025Updated last year
johnbean393 / SVGBench
View on GitHub
SVGBench: A challenging LLM benchmark that tests knowledge, coding, physical reasoning capabilities of LLMs.
☆72Feb 12, 2026Updated 5 months ago
cp3249 / splaa
View on GitHub
SPLAA is an AI assistant framework that utilizes voice recognition, text-to-speech, and tool-calling capabilities to provide a conversati…
☆29May 6, 2025Updated last year
jd-3d / SOLOBench
View on GitHub
☆136May 2, 2025Updated last year
Tencent-Hunyuan / Hunyuan-4B
View on GitHub
☆16Aug 5, 2025Updated 11 months ago
seruva19 / mecchi
View on GitHub
Node-based web UI for AI-powered music generation.
☆13Aug 22, 2025Updated 11 months ago
kobihackenburg / GPT-4-political-microtargeting
View on GitHub
Project repository for "Evaluating the persuasive influence of political microtargeting with large language models" by Kobi Hackenburg an…
☆11Jun 19, 2024Updated 2 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
cpldcpu / MisguidedAttention
View on GitHub
A collection of prompts to challenge the reasoning abilities of large language models in presence of misguiding information
☆482Jul 31, 2025Updated 11 months ago
keeeeenw / TinyLlama
View on GitHub
The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
☆14Mar 30, 2024Updated 2 years ago
lszxb / bf16_huffman_infer
View on GitHub
Fused BF16 Huffman GEMV Inference kernel
☆22Apr 22, 2026Updated 3 months ago
kuzudb / dspy-kuzu-demo
View on GitHub
Intro to using DSPy with Kuzu to enrich the data within the Nobel Laureate mentorship network
☆16Sep 16, 2025Updated 10 months ago
LeonEricsson / llmjudge
View on GitHub
Exploring limitations of LLM-as-a-judge
☆20Aug 17, 2024Updated last year
kaistAI / factual-knowledge-acquisition
View on GitHub
☆25Dec 12, 2025Updated 7 months ago
lukepur / vue-port-graph
View on GitHub
☆13Jun 21, 2017Updated 9 years ago
osome-iu / AI_fact_checking
View on GitHub
We conduct a preregistered experiment to investigate whether fact checks provided by a large language model can serve as an effective mis…
☆13Dec 14, 2024Updated last year
Foaster-ai / Werewolf-bench
View on GitHub
☆33Aug 30, 2025Updated 10 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
gptme / gptme-rag
View on GitHub
Local RAG as a simple CLI, for standalone use or as a gptme tool
☆48Jul 5, 2026Updated 2 weeks ago
sparkle-reasoning / sparkle
View on GitHub
[NeurIPS'25] Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning
☆16Dec 12, 2025Updated 7 months ago
kmad / dabench-rlm-eval
View on GitHub
Benchmark harness for evaluating DSPy RLMs on data analysis tasks (InfiAgent-DABench)
☆23Mar 22, 2026Updated 4 months ago
ahxt / mini-r1-zero
View on GitHub
☆20Feb 2, 2025Updated last year
inclusionAI / GroveMoE
View on GitHub
☆24Aug 20, 2025Updated 11 months ago
Toy-97 / Chat-WebUI
View on GitHub
Chat WebUI is an easy-to-use user interface for interacting with AI, and it comes with multiple useful built-in tools such as web search …
☆52Feb 10, 2026Updated 5 months ago
xuyiqing / hbal
View on GitHub
Hierarchically Regularized Entropy Balancing
☆12Sep 20, 2025Updated 10 months ago