HumanEval-V / HumanEval-V-BenchmarkLinks
A Lightweight Visual Reasoning Benchmark for Evaluating Large Multimodal Models through Complex Diagrams in Coding Tasks
☆12Updated 7 months ago
Alternatives and similar repositories for HumanEval-V-Benchmark
Users that are interested in HumanEval-V-Benchmark are comparing it to the libraries listed below
Sorting:
- [ICML 2025] M-STAR (Multimodal Self-Evolving TrAining for Reasoning) Project. Diving into Self-Evolving Training for Multimodal Reasoning☆69Updated 2 months ago
- ☆28Updated 4 months ago
- The official repo for "AceCoder: Acing Coder RL via Automated Test-Case Synthesis" [ACL25]☆88Updated 5 months ago
- Pitfalls of Rule- and Model-based Verifiers: A Case Study on Mathematical Reasoning.☆23Updated 3 months ago
- ☆18Updated 5 months ago
- The rule-based evaluation subset and code implementation of Omni-MATH☆23Updated 9 months ago
- ☆131Updated 2 weeks ago
- [2025-TMLR] A Survey on the Honesty of Large Language Models☆59Updated 9 months ago
- [EMNLP 2024] Multi-modal reasoning problems via code generation.☆25Updated 7 months ago
- [ICLR 25 Oral] RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style☆62Updated 2 months ago
- A Sober Look at Language Model Reasoning☆83Updated 2 weeks ago
- ☆113Updated last month
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆97Updated last year
- ☆43Updated 5 months ago
- ☆50Updated 11 months ago
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learning☆111Updated 4 months ago
- A unified suite for generating elite reasoning problems and training high-performance LLMs, including pioneering attention-free architect…☆67Updated 3 months ago
- [NeurIPS 2025 Spotlight] ReasonFlux-Coder: Open-Source LLM Coders with Co-Evolving Reinforcement Learning☆117Updated last week
- In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)☆61Updated last year
- Optimizing Anytime Reasoning via Budget Relative Policy Optimization☆47Updated 2 months ago
- ☆61Updated 3 months ago
- Model merging is a highly efficient approach for long-to-short reasoning.☆82Updated 3 months ago
- ☆39Updated last year
- Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping☆54Updated 4 months ago
- InstructCoder: Instruction Tuning Large Language Models for Code Editing | Oral ACL-2024 srw☆62Updated 11 months ago
- instruction-following benchmark for large reasoning models☆42Updated last month
- ☆21Updated 4 months ago
- [ACL 2024 Findings] CriticBench: Benchmarking LLMs for Critique-Correct Reasoning☆27Updated last year
- ☆32Updated 2 weeks ago
- [AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy☆69Updated 9 months ago