A Lightweight Visual Reasoning Benchmark for Evaluating Large Multimodal Models through Complex Diagrams in Coding Tasks
☆15Feb 25, 2025Updated last year
Alternatives and similar repositories for HumanEval-V-Benchmark
Users that are interested in HumanEval-V-Benchmark are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Source code for ISSTA'24 paper "AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code Generation"☆12Oct 21, 2024Updated last year
- 基于CodeBert预训练模型,微调后/直接对目标数据集进行测试☆14Oct 19, 2021Updated 4 years ago
- EvoEval: Evolving Coding Benchmarks via LLM☆81Apr 6, 2024Updated 2 years ago
- Replication package for ISSTA2023 paper - Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond☆23Apr 9, 2023Updated 3 years ago
- [ACL'25 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"☆41Apr 7, 2025Updated last year
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- This repo is for our submission for ICSE 2025.☆20Jun 12, 2024Updated last year
- Analyzing Stock Movements using Markov Chains and Monte Carlo Simulation☆12Dec 25, 2016Updated 9 years ago
- We introduce FixEval , a dataset for competitive programming bug fixing along with a comprehensive test suite and show the necessity of e…☆26Aug 31, 2022Updated 3 years ago
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆102Oct 23, 2024Updated last year
- The code and datasets of our ACM MM 2024 paper "Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed …☆11Sep 27, 2024Updated last year
- ☆49Jul 24, 2022Updated 3 years ago
- [NeurIPS 2023] "Diversified Outlier Exposure for Out-of-Distribution Detection via Informative Extrapolation"☆11Oct 6, 2023Updated 2 years ago
- ☆10Mar 13, 2023Updated 3 years ago
- This repo illustrates how to evaluate the artifacts in the paper An Extensive Study on Pre-trained Models for Program Understanding and G…☆27Aug 12, 2022Updated 3 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment☆17Dec 19, 2024Updated last year
- ☆12Jan 17, 2024Updated 2 years ago
- [ICLR 2025] SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models☆18Sep 17, 2025Updated 7 months ago
- Replication Package for "Compressing Pre-trained Models of Code into 3 MB", ASE 2022☆30Oct 10, 2024Updated last year
- (AAAI 2026) OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to…☆13May 13, 2025Updated 11 months ago
- ☆40Apr 6, 2026Updated last month
- ☆14Aug 18, 2025Updated 8 months ago
- ☆33Sep 14, 2025Updated 7 months ago
- Detection and Classification of UI Elements of Web pages and Apps from Wireframe Sketches☆10Oct 9, 2023Updated 2 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- [ICML 2023] "Unleashing Mask: Explore the Intrinsic Out-of-Distribution Detection Capability"☆18Jul 7, 2023Updated 2 years ago
- [ISSTA 2025] A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing☆13Feb 9, 2025Updated last year
- ☆11May 14, 2024Updated last year
- SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types☆25Nov 29, 2024Updated last year
- Model your data with the Von Mises-Fisher distribution in Python☆14Feb 1, 2016Updated 10 years ago
- This is the code repo for the paper "RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards".☆24Oct 28, 2024Updated last year
- ☆17Mar 22, 2021Updated 5 years ago
- ☆12Aug 9, 2023Updated 2 years ago
- ☆16May 25, 2022Updated 3 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- the paper "Geometry-aware Instance-reweighted Adversarial Training" ICLR 2021 oral☆59Apr 13, 2021Updated 5 years ago
- ☆13Jan 19, 2026Updated 3 months ago
- ☆13Nov 8, 2022Updated 3 years ago
- ☆44Dec 8, 2025Updated 5 months ago
- [ACL 2025 Findings] Implicit Reasoning in Transformers is Reasoning through Shortcuts☆18Mar 11, 2025Updated last year
- This is the tool released in ICSE 2024 paper "Domain Knowledge Matters: Improving Prompts with Fix Templates for Repairing Python Type Er…☆17Jun 5, 2023Updated 2 years ago
- Implementation for AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications☆38Mar 24, 2026Updated last month