A Lightweight Visual Reasoning Benchmark for Evaluating Large Multimodal Models through Complex Diagrams in Coding Tasks
☆15Feb 25, 2025Updated last year
Alternatives and similar repositories for HumanEval-V-Benchmark
Users that are interested in HumanEval-V-Benchmark are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- 基于CodeBert预训练模型,微调后/直接对目标数据集进行测试☆14Oct 19, 2021Updated 4 years ago
- EvoEval: Evolving Coding Benchmarks via LLM☆83Apr 6, 2024Updated 2 years ago
- Replication package for ISSTA2023 paper - Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond☆23Apr 9, 2023Updated 3 years ago
- [ACL'25 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"☆41Apr 7, 2025Updated last year
- This repo is for our submission for ICSE 2025.☆20Jun 12, 2024Updated last year
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- This is the official implement for the paper 'Domain Adaptive Code Completion via Language Models and Decoupled Domain Databases''☆14Oct 4, 2023Updated 2 years ago
- ☆14Jan 22, 2025Updated last year
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs☆103Oct 23, 2024Updated last year
- The code and datasets of our ACM MM 2024 paper "Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed …☆11Sep 27, 2024Updated last year
- [NeurIPS 2023] "Diversified Outlier Exposure for Out-of-Distribution Detection via Informative Extrapolation"☆11Oct 6, 2023Updated 2 years ago
- UICrit is a dataset containing human-generated natural language design critiques, corresponding bounding boxes for each critique, and des…☆25Nov 19, 2024Updated last year
- RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment☆17Dec 19, 2024Updated last year
- PyTorch使用技巧和教程☆12Apr 17, 2023Updated 3 years ago
- [EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code☆79Jun 16, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Replication Package for "Compressing Pre-trained Models of Code into 3 MB", ASE 2022☆30Oct 10, 2024Updated last year
- [NeurIPS 2024] TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration☆25Oct 17, 2024Updated last year
- Exercises for the Dafny Tutorial☆14May 21, 2018Updated 8 years ago
- (AAAI 2026) OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to…☆13May 13, 2025Updated last year
- ☆40Apr 6, 2026Updated last month
- ☆34Sep 14, 2025Updated 8 months ago
- ☆10Apr 15, 2023Updated 3 years ago
- Detection and Classification of UI Elements of Web pages and Apps from Wireframe Sketches☆10Oct 9, 2023Updated 2 years ago
- [ICML 2023] "Unleashing Mask: Explore the Intrinsic Out-of-Distribution Detection Capability"☆18Jul 7, 2023Updated 2 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- [ISSTA 2025] A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing☆13Feb 9, 2025Updated last year
- ☆10May 14, 2024Updated 2 years ago
- [ICLR 2026] Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing☆28May 11, 2026Updated 2 weeks ago
- SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types☆25Nov 29, 2024Updated last year
- This is the code repo for the paper "RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards".☆24Oct 28, 2024Updated last year
- ☆12Aug 9, 2023Updated 2 years ago
- ☆13Jan 19, 2026Updated 4 months ago
- ☆13Nov 8, 2022Updated 3 years ago
- [ACL 2025 Findings] Implicit Reasoning in Transformers is Reasoning through Shortcuts☆18Mar 11, 2025Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Collected the world's best computer vision labs and lecture materials.☆14Feb 23, 2025Updated last year
- ☆16Nov 24, 2023Updated 2 years ago
- ☆16Aug 26, 2023Updated 2 years ago
- ☆12Jun 8, 2017Updated 8 years ago
- Official PyTorch implementation for paper: Energy-Based Sliced Wasserstein Distance☆18Feb 21, 2025Updated last year
- [ICLR 2025] Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision☆72Jul 10, 2024Updated last year
- Web site for standardml.org.☆37Oct 17, 2023Updated 2 years ago