thunlp/ChatEval

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/thunlp/ChatEval)

thunlp / ChatEval

Codes for our paper "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate"

☆340

Alternatives and similar repositories for ChatEval

Users that are interested in ChatEval are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

composable-models / llm_multiagent_debate
View on GitHub
ICML 2024: Improving Factuality and Reasoning in Language Models through Multiagent Debate
☆544Apr 24, 2025Updated last year
Skytliang / Multi-Agents-Debate
View on GitHub
MAD: The first work to explore Multi-Agent Debate with Large Language Models :D
☆599Dec 16, 2025Updated 7 months ago
OpenBMB / AgentVerse
View on GitHub
🤖 AgentVerse 🪐 is designed to facilitate the deployment of multiple LLM-based agents in various applications, which primarily provides …
☆5,091Sep 9, 2024Updated last year
SU-JIAYUAN / M-MAD
View on GitHub
[ACL'25] Repo for paper "M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation"
☆25Feb 19, 2025Updated last year
gauss5930 / LLM-Agora
View on GitHub
LLM Agora, debating between open-source LLMs to refine the answers
☆88Sep 28, 2023Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
TIGER-AI-Lab / TIGERScore
View on GitHub
"TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks" [TMLR 2024]
☆32Dec 21, 2024Updated last year
thunlp / Optima
View on GitHub
Code for paper "Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System"
☆72Nov 14, 2024Updated last year
i-Eval / FairEval
View on GitHub
☆145Sep 10, 2023Updated 2 years ago
LuJunru / MemoChat
View on GitHub
MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-Domain Conversation
☆29Apr 18, 2024Updated 2 years ago
hkust-nlp / AgentBoard
View on GitHub
An Analytical Evaluation Board of Multi-turn LLM Agents [NeurIPS 2024 Oral]
☆430May 20, 2024Updated 2 years ago
Paitesanshi / LLM-Agent-Survey
View on GitHub
☆2,909Feb 20, 2025Updated last year
zjunlp / LLMAgentPapers
View on GitHub
Must-read Papers on LLM Agents.
☆3,093Updated this week
Link-AGI / AutoAgents
View on GitHub
[IJCAI 2024] Generate different roles for GPTs to form a collaborative entity for complex tasks.
☆1,490Sep 9, 2025Updated 10 months ago
FlagOpen / Infinity-Instruct
View on GitHub
☆51Jun 14, 2024Updated 2 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
tmlr-group / ECON
View on GitHub
[ICML 2025] "From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium"
☆39Nov 23, 2025Updated 8 months ago
Farama-Foundation / ChatArena
View on GitHub
ChatArena (or Chat Arena) is a Multi-Agent Language Game Environments for LLMs. The goal is to develop communication and collaboration ca…
☆1,551Aug 11, 2025Updated 11 months ago
yilunzhao / RobuT
View on GitHub
Data and code for ACL 2023 paper "RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations"
☆15Feb 8, 2024Updated 2 years ago
jinlanfu / GPTScore
View on GitHub
Source Code of Paper "GPTScore: Evaluate as You Desire"
☆258Feb 21, 2023Updated 3 years ago
SALT-NLP / DyLAN
View on GitHub
Official Implementation of Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization
☆210May 16, 2024Updated 2 years ago
snap-stanford / MLAgentBench
View on GitHub
☆346Jun 19, 2024Updated 2 years ago
CUHK-ARISE / PsychoBench
View on GitHub
Code and data for the paper: On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs
☆135Jan 24, 2026Updated 6 months ago
THUDM / AgentBench
View on GitHub
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
☆3,620Feb 8, 2026Updated 5 months ago
ypw0102 / BatchEval
View on GitHub
code for ACL2024-main: BatchEval: Towards Human-like Text Evaluation
☆19May 20, 2024Updated 2 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
iiis-ai / IterativeQuestionComposing
View on GitHub
[AAAI 2025] Augmenting Math Word Problems via Iterative Question Composing (https://arxiv.org/abs/2401.09003)
☆23Oct 2, 2025Updated 9 months ago
tatsu-lab / alpaca_eval
View on GitHub
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
☆2,008Aug 9, 2025Updated 11 months ago
madaan / self-refine
View on GitHub
LLMs can generate feedback on their work, use it to improve the output, and repeat this process iteratively.
☆815Oct 4, 2024Updated last year
MikeWangWZHL / Solo-Performance-Prompting
View on GitHub
Repo for paper "Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration"
☆353May 8, 2024Updated 2 years ago
IBM / SALMON
View on GitHub
Self-Alignment with Principle-Following Reward Models
☆170Sep 18, 2025Updated 10 months ago
OpenBMB / UltraEval
View on GitHub
[ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.
☆257Oct 30, 2024Updated last year
INK-USC / ReCross
View on GitHub
ReCross: Unsupervised Cross-Task Generalization via Retrieval Augmentation
☆23May 1, 2022Updated 4 years ago
MLGroupJLU / LLM-eval-survey
View on GitHub
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
☆1,611Apr 17, 2026Updated 3 months ago
anchen1011 / FireAct
View on GitHub
FireAct: Toward Language Agent Fine-tuning
☆296Oct 22, 2023Updated 2 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
da03 / criticize_text_generation
View on GitHub
A method for evaluating the high-level coherence of machine-generated texts. Identifies high-level coherence issues in transformer-based …
☆12Mar 18, 2023Updated 3 years ago
krystalan / chatgpt_as_nlg_evaluator
View on GitHub
Technical Report: Is ChatGPT a Good NLG Evaluator? A Preliminary Study
☆43Mar 8, 2023Updated 3 years ago
kevinyaobytedance / llm_eval
View on GitHub
LLM evaluation.
☆16Nov 7, 2023Updated 2 years ago
allenai / WildBench
View on GitHub
Benchmarking LLMs with Challenging Tasks from Real Users
☆255Nov 3, 2024Updated last year
Lordog / R-Judge
View on GitHub
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)
☆108Jan 11, 2026Updated 6 months ago
aiwaves-cn / agents
View on GitHub
An Open-source Framework for Data-centric, Self-evolving Autonomous Language Agents
☆5,956Sep 26, 2024Updated last year
zzma2 / medical-llm-reasoning-survey
View on GitHub
A curated list of medical reasoning research on large language models, organized by modality, technique, application, and benchmark.
☆19Oct 17, 2025Updated 9 months ago