HumanEval-V / HumanEval-V-BenchmarkLinks

A Lightweight Visual Reasoning Benchmark for Evaluating Large Multimodal Models through Complex Diagrams in Coding Tasks

☆12

Alternatives and similar repositories for HumanEval-V-Benchmark

Users that are interested in HumanEval-V-Benchmark are comparing it to the libraries listed below

Sorting:

TIGER-AI-Lab / AceCoder
The official repo for "AceCoder: Acing Coder RL via Automated Test-Case Synthesis" [ACL25]
☆86Updated 4 months ago
SihengLi99 / LLM-Honesty-Survey
[2025-TMLR] A Survey on the Honesty of Large Language Models
☆58Updated 8 months ago
UCSB-NLP-Chang / ThinkPrune
☆39Updated 3 months ago
hkust-nlp / mstar
[ICML 2025] M-STAR (Multimodal Self-Evolving TrAining for Reasoning) Project. Diving into Self-Evolving Training for Multimodal Reasoning
☆64Updated 3 weeks ago
ZhentingWang / DUMP
☆23Updated 3 months ago
GeniusHTX / TALE
☆126Updated 2 months ago
hkust-nlp / RL-Verifier-Pitfalls
Pitfalls of Rule- and Model-based Verifiers: A Case Study on Mathematical Reasoning.
☆22Updated 2 months ago
sail-sg / ActivePRM
☆17Updated 3 months ago
test-time-interaction / TTI
☆53Updated 2 months ago
GAIR-NLP / ReasonEval
[AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy
☆65Updated 7 months ago
JayZhang42 / SLED
SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Model https://arxiv.org/pdf/2411.02433
☆28Updated 8 months ago
MJ-Bench / MJ-Bench
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
☆46Updated 2 months ago
Hambaobao / SWE-Flow
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner
☆23Updated last month
HKUNLP / critic-rl
[ICML 2025] Teaching Language Models to Critique via Reinforcement Learning
☆108Updated 3 months ago
hkust-nlp / Laser
Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
☆52Updated 2 months ago
likaixin2000 / MMCode
[EMNLP 2024] Multi-modal reasoning problems via code generation.
☆23Updated 6 months ago
KbsdJames / omni-math-rule
The rule-based evaluation subset and code implementation of Omni-MATH
☆22Updated 7 months ago
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆61Updated last year
Gen-Verse / CURE
Open-Source LLM Coders with Co-Evolving Reinforcement Learning
☆103Updated 2 weeks ago
inclusionAI / PromptCoT
A unified suite for generating elite reasoning problems and training high-performance LLMs, including pioneering attention-free architect…
☆64Updated 2 months ago
ssmisya / PRMBench
[ACL' 25] The official code repository for PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models.
☆78Updated 5 months ago
GAIR-NLP / weak-to-strong-reasoning
☆59Updated 11 months ago
SIMONLQY / RethinkMCTS
☆28Updated 10 months ago
horseee / CoT-Valve
CoT-Valve: Length-Compressible Chain-of-Thought Tuning
☆82Updated 5 months ago
THU-KEG / RM-Bench
[ICLR 25 Oral] RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
☆58Updated 3 weeks ago
CodeEval-Pro / CodeEval-Pro
[ACL'2025 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task…
☆28Updated 4 months ago
RUCAIBox / RLMEC
The official repository of "Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint"
☆38Updated last year
ShadeCloak / ADORA
☆46Updated 4 months ago
KbsdJames / Omni-MATH
The official repository of the Omni-MATH benchmark.
☆87Updated 7 months ago
SafeAILab / RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
☆96Updated last year