open-compass/CriticEval

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/open-compass/CriticEval)

open-compass / CriticEval

[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs

☆49

Alternatives and similar repositories for CriticEval

Users that are interested in CriticEval are comparing it to the libraries listed below

Sorting:

open-compass / MathBench
View on GitHub
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
☆111May 22, 2025Updated 9 months ago
usail-hkust / benchmark_inference_time_computation_LLM
View on GitHub
[NeurIPS 2025] Bag of Tricks for Inference-time Computation of LLM Reasoning
☆16Sep 20, 2025Updated 5 months ago
LuoXiaoHeics / Continual-Tune
View on GitHub
☆10Feb 6, 2025Updated last year
zhudotexe / fanoutqa
View on GitHub
Companion code for FanOutQA: Multi-Hop, Multi-Document Question Answering for Large Language Models (ACL 2024)
☆59Sep 22, 2025Updated 5 months ago
infi-coder / infibench-evaluation-harness
View on GitHub
The Infibench variant of bigcode-evaluation-harness --- a framework for the evaluation of autoregressive code generation language models.
☆14Oct 19, 2024Updated last year
gmftbyGMFTBY / Study
View on GitHub
Good good study, day day ugly
☆10Dec 12, 2018Updated 7 years ago
GAIR-NLP / MetaCritique
View on GitHub
Evaluate the Quality of Critique
☆36Jun 1, 2024Updated last year
GAIR-NLP / auto-j
View on GitHub
Generative Judge for Evaluating Alignment
☆250Jan 18, 2024Updated 2 years ago
peng-weihan / SWE-QA-Bench
View on GitHub
☆45Jan 21, 2026Updated last month
LightChen233 / reasoning-boundary
View on GitHub
☆70Jun 18, 2025Updated 8 months ago
princeton-nlp / ELIZA-Transformer
View on GitHub
[NAACL 2025] Representing Rule-based Chatbots with Transformers
☆23Feb 9, 2025Updated last year
DAMO-NLP-SG / Auto-Arena-LLMs
View on GitHub
☆43Oct 7, 2024Updated last year
allenai / numglue
View on GitHub
NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks
☆20May 10, 2022Updated 3 years ago
gouki510 / Topology_of_Reasoning
View on GitHub
☆40Jun 11, 2025Updated 8 months ago
TIGER-AI-Lab / AceCoder
View on GitHub
The official repo for "AceCoder: Acing Coder RL via Automated Test-Case Synthesis" [ACL25]
☆97Apr 9, 2025Updated 10 months ago
CGCL-codes / GraphInstruct
View on GitHub
The benchmark proposed in paper: GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability
☆23Aug 12, 2025Updated 6 months ago
google-research-datasets / wikifact
View on GitHub
Wikipedia based dataset to train relationship classifiers and fact extraction models
☆26May 25, 2021Updated 4 years ago
naver-ai / ALMoST
View on GitHub
☆24Dec 2, 2023Updated 2 years ago
THUDM / ChatGLM-Math
View on GitHub
☆83Apr 18, 2024Updated last year
3B-Group / ConvRe
View on GitHub
🤖ConvRe🤯: An Investigation of LLMs’ Inefficacy in Understanding Converse Relations (EMNLP 2023)
☆24Oct 10, 2023Updated 2 years ago
zhaoxlpku / SubgoalXL
View on GitHub
☆25Aug 23, 2024Updated last year
comprehensiveMap / MDFN
View on GitHub
The code for Mask-based Decoupling-Fusing Network
☆23Dec 14, 2020Updated 5 years ago
open-compass / ANAH
View on GitHub
[ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO
☆62Apr 30, 2025Updated 10 months ago
open-compass / T-Eval
View on GitHub
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
☆304Apr 3, 2024Updated last year
thu-coai / ComplexBench
View on GitHub
Benchmarking Complex Instruction-Following with Multiple Constraints Composition (NeurIPS 2024 Datasets and Benchmarks Track)
☆102Feb 20, 2025Updated last year
jaehunjung1 / cascaded-selective-evaluation
View on GitHub
☆29Feb 24, 2025Updated last year
NExT-GPT / NExT-GPT.github.io
View on GitHub
NExT-GPT: Any-to-Any Multimodal Large Language Model
☆20Nov 3, 2024Updated last year
open-compass / GTA
View on GitHub
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
☆135Feb 16, 2026Updated 2 weeks ago
blcuicall / mcts
View on GitHub
Code and data of the paper "MCTS: A Multi-Reference Chinese Text Simplification Dataset".
☆33Jun 3, 2024Updated last year
open-compass / BotChat
View on GitHub
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
☆161May 22, 2025Updated 9 months ago
QwenLM / Qwen-Cookbook
View on GitHub
Open-source examples and guides for building with the Qwen. Browse a collection of snippets, advanced techniques and walkthroughs.
☆37Nov 20, 2024Updated last year
kyegomez / Reka-Torch
View on GitHub
Implementation of the model: "Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models" in PyTorch
☆28Feb 9, 2026Updated 3 weeks ago
QwenLM / ProcessBench
View on GitHub
Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"
☆184May 20, 2025Updated 9 months ago
ZHZisZZ / weak-to-strong-search
View on GitHub
[NeurIPS'24] Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
☆66Dec 10, 2024Updated last year
zankner / CLoud
View on GitHub
Critique-out-Loud Reward Models
☆74Oct 18, 2024Updated last year
llmeval / LLMEval-1
View on GitHub
中文大语言模型评测第一期
☆113Oct 23, 2023Updated 2 years ago
orhonovich / q-squared
View on GitHub
☆30Sep 5, 2021Updated 4 years ago
Skytliang / COT-Reading-List
View on GitHub
☆27Mar 6, 2023Updated 2 years ago
mtbench101 / mt-bench-101
View on GitHub
[ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
☆142Jul 24, 2024Updated last year