chenllliang / MMEvalProLinks

[NAACL 2025] Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs

☆24

Alternatives and similar repositories for MMEvalPro

Users that are interested in MMEvalPro are comparing it to the libraries listed below

Sorting:

kokolerk / TON
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
☆40Updated 3 weeks ago
SihengLi99 / SEALONG
Large Language Models Can Self-Improve in Long-context Reasoning
☆72Updated 8 months ago
yunfeixie233 / ViGaL
☆50Updated last month
jinzhuoran / RAG-RewardBench
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment
☆16Updated 7 months ago
DAMO-NLP-SG / LongPO
[ICLR 2025] LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
☆38Updated 5 months ago
TingchenFu / MathIF
instruction-following benchmark for large reasoning models
☆36Updated 2 months ago
dvlab-research / Mr-Ben
This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"
☆50Updated 9 months ago
TIGER-AI-Lab / VisualWebInstruct
The official repo for "VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search"
☆26Updated 3 months ago
beichenzbc / BoostStep
official code for "BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning"
☆36Updated 6 months ago
MileBench / MileBench
This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"
☆36Updated last year
VisualWebBench / VisualWebBench
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"
☆58Updated 9 months ago
GAIR-NLP / weak-to-strong-reasoning
☆59Updated 11 months ago
GAIR-NLP / OlympicArena
[NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
☆102Updated 4 months ago
hkust-nlp / mstar
[ICML 2025] M-STAR (Multimodal Self-Evolving TrAining for Reasoning) Project. Diving into Self-Evolving Training for Multimodal Reasoning
☆63Updated 3 weeks ago
DualityRL / multi-attempt
☆19Updated 4 months ago
test-time-interaction / TTI
☆53Updated last month
EffiVLM-Bench / EffiVLM-Bench
☆19Updated 2 months ago
mathllm / Step-Controlled_DPO
☆22Updated last year
open-compass / CriticEval
[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
☆43Updated 8 months ago
dvlab-research / Q-LLM
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
☆54Updated last year
TIGER-AI-Lab / AceCoder
The official repo for "AceCoder: Acing Coder RL via Automated Test-Case Synthesis" [ACL25]
☆87Updated 3 months ago
LCM-Lab / LCM_Stack
Code for paper: Long cOntext aliGnment via efficient preference Optimization
☆14Updated 5 months ago
wwxu21 / CUT
Source code of "Reasons to Reject? Aligning Language Models with Judgments"
☆58Updated last year
sail-sg / scaling-with-vocab
[NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623
☆86Updated 10 months ago
yayayacc / MUR
☆40Updated last week
MiroMindAsia / MiroMind-M1
MiroMind-M1 is a fully open-source series of reasoning language models built on Qwen-2.5, focused on advancing mathematical reasoning.
☆106Updated this week
shulin16 / MMInA
[ACL2025 Findings] Benchmarking Multihop Multimodal Internet Agents
☆46Updated 5 months ago
RUCAIBox / JiuZhang3.0
The code and data for the paper JiuZhang3.0
☆48Updated last year
chtmp223 / suri
Suri: Multi-constraint instruction following for long-form text generation (EMNLP’24)
☆25Updated 8 months ago
HZQ950419 / Math-LLaVA
Code for Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
☆90Updated last year