HumanEval-V / HumanEval-V-BenchmarkView external linksLinks
A Lightweight Visual Reasoning Benchmark for Evaluating Large Multimodal Models through Complex Diagrams in Coding Tasks
☆14Feb 25, 2025Updated 11 months ago
Alternatives and similar repositories for HumanEval-V-Benchmark
Users that are interested in HumanEval-V-Benchmark are comparing it to the libraries listed below
Sorting:
- Source code for ISSTA'24 paper "AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code Generation"☆12Oct 21, 2024Updated last year
- ☆11Mar 13, 2023Updated 2 years ago
- The code and datasets of our ACM MM 2024 paper "Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed …☆11Sep 27, 2024Updated last year
- ☆13Jan 22, 2025Updated last year
- RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment☆16Dec 19, 2024Updated last year
- [ACL'25 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"☆37Apr 7, 2025Updated 10 months ago
- [ICLR 2025] SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models☆16Sep 17, 2025Updated 4 months ago
- 基于CodeBert预训练模型,微调后/直接对目标数据集进行测试☆14Oct 19, 2021Updated 4 years ago
- Source code embeddings for various programming languages☆16Jul 11, 2018Updated 7 years ago
- [ACL 2025 Findings] Implicit Reasoning in Transformers is Reasoning through Shortcuts☆17Mar 11, 2025Updated 11 months ago
- This is the official implement for the paper 'Domain Adaptive Code Completion via Language Models and Decoupled Domain Databases''☆14Oct 4, 2023Updated 2 years ago
- SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types☆24Nov 29, 2024Updated last year
- EvoEval: Evolving Coding Benchmarks via LLM☆81Apr 6, 2024Updated last year
- Replication package for ISSTA2023 paper - Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond☆23Apr 9, 2023Updated 2 years ago
- Equivalent Linear Mappings of Large Language Models☆30Nov 7, 2025Updated 3 months ago
- This repo is for our submission for ICSE 2025.☆20Jun 12, 2024Updated last year
- [NeurIPS'25] ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and R…☆30Sep 27, 2025Updated 4 months ago
- [NAACL 2025] Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs☆24Sep 26, 2024Updated last year
- ☆25Aug 2, 2025Updated 6 months ago
- Implementation for AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications☆32Apr 21, 2025Updated 9 months ago
- Official code and data repository of MathChat: MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Inte…☆22Jun 3, 2024Updated last year
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆101Oct 23, 2024Updated last year
- Reproducible Language Agent Research☆33Jun 25, 2025Updated 7 months ago
- Modality Gap–Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models☆48Updated this week
- [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration☆26Oct 17, 2024Updated last year
- We introduce FixEval , a dataset for competitive programming bug fixing along with a comprehensive test suite and show the necessity of e…☆26Aug 31, 2022Updated 3 years ago