open-compass / CriticBench

A comprehensive benchmark for evaluating critique ability of LLMs

☆25

Related projects: ⓘ

LightChen233 / M3CoT
☆31Updated 3 months ago
MileBench / MileBench
This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"
☆21Updated 2 months ago
dvlab-research / Mr-Ben
This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"
☆38Updated 2 months ago
starrYYxuan / LeCo
This the implementation of LeCo
☆16Updated 2 months ago
wwxu21 / CUT
Source code of "Reasons to Reject? Aligning Language Models with Judgments"
☆54Updated 6 months ago
GAIR-NLP / ReasonEval
Evaluating Mathematical Reasoning Beyond Accuracy
☆32Updated 5 months ago
chujiezheng / LLM-Extrapolation
Official repository for paper "Weak-to-Strong Extrapolation Expedites Alignment"
☆62Updated 3 months ago
Bolin97 / awesome-instruction-selector
Paper list and datasets for the paper: A Survey on Data Selection for LLM Instruction Tuning
☆27Updated 7 months ago
dvlab-research / Q-LLM
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
☆36Updated 2 months ago
mtbench101 / mt-bench-101
[ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
☆38Updated last month
GAIR-NLP / SimulateBench
GPT as Human
☆17Updated 8 months ago
shuyhere / about-super-alignment
Feeling confused about super alignment? Here is a reading list
☆42Updated 8 months ago
meowpass / FollowComplexInstruction
Official implementation of the paper "From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large L…
☆34Updated 2 months ago
HZQ950419 / Math-LLaVA
Code for Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
☆52Updated 2 months ago
OpenMOSS / Say-I-Dont-Know
[ICML'2024] Can AI Assistants Know What They Don't Know?
☆62Updated 7 months ago
WeiminXiong / IPR
Watch Every Step! LLM Agent Learning via Iterative Step-level Process Refinement
☆21Updated last month
chenllliang / MMEvalPro
Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs
☆21Updated 2 months ago
pldlgb / nuggets
☆71Updated 8 months ago
OSU-NLP-Group / LLM-Knowledge-Conflict
[ICLR'24 Spotlight] "Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts"
☆51Updated 5 months ago
ernie-research / Tool-Augmented-Reward-Model
[ICLR'24 spotlight] Tool-Augmented Reward Modeling
☆33Updated 6 months ago
Junjie-Ye / ToolEyes
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
☆63Updated 5 months ago
open-compass / MathBench
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
☆77Updated 2 months ago
princeton-nlp / CharXiv
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
☆66Updated last month
princeton-nlp / QuRating
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆134Updated 3 months ago
MozerWang / Loong
[arxiv:2406.17419]Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
☆62Updated last month
GAIR-NLP / weak-to-strong-reasoning
☆46Updated 2 weeks ago
open-compass / Ada-LEval
The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
☆49Updated 5 months ago
ChengpengLi1003 / DotaMath
☆23Updated 2 months ago
DAMO-NLP-SG / CLEX
[ICLR 2024] CLEX: Continuous Length Extrapolation for Large Language Models
☆72Updated 6 months ago
Shwai-He / MEO
The source code of "Merging Experts into One: Improving Computational Efficiency of Mixture of Experts (EMNLP 2023)":
☆31Updated 5 months ago