HumanEval-V / HumanEval-V-BenchmarkLinks
A Lightweight Visual Reasoning Benchmark for Evaluating Large Multimodal Models through Complex Diagrams in Coding Tasks
☆12Updated 4 months ago
Alternatives and similar repositories for HumanEval-V-Benchmark
Users that are interested in HumanEval-V-Benchmark are comparing it to the libraries listed below
Sorting:
- A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.☆61Updated last month
- More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models☆21Updated 3 weeks ago
- [ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"☆81Updated last year
- ☆119Updated last month
- [EMNLP 2024] Multi-modal reasoning problems via code generation.☆23Updated 4 months ago
- [ICML 2024 Oral] Official code repository for MLLM-as-a-Judge.☆70Updated 4 months ago
- ☆19Updated 8 months ago
- ☆19Updated last month
- Code repo for the paper: Attacking Vision-Language Computer Agents via Pop-ups☆33Updated 6 months ago
- A Survey on the Honesty of Large Language Models☆57Updated 6 months ago
- [ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning☆94Updated last year
- ☆29Updated last week
- Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"☆44Updated 3 weeks ago
- ☆30Updated last year
- [ICLR 2025] Official codebase for the ICLR 2025 paper "Multimodal Situational Safety"☆18Updated this week
- ☆33Updated 8 months ago
- ☆46Updated 7 months ago
- Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)☆55Updated 3 months ago
- Official codebase for "STAIR: Improving Safety Alignment with Introspective Reasoning"☆42Updated 4 months ago
- ☆46Updated 2 months ago
- [ICML 2025] M-STAR (Multimodal Self-Evolving TrAining for Reasoning) Project. Diving into Self-Evolving Training for Multimodal Reasoning☆60Updated 6 months ago
- The reinforcement learning codes for dataset SPA-VL☆34Updated last year
- The official repository for paper "MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance"☆37Updated last year
- [ACL'2025 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task…☆28Updated 2 months ago
- ☆23Updated 3 months ago
- ☆31Updated this week
- This repository contains the code and data for the paper "SelfIE: Self-Interpretation of Large Language Model Embeddings" by Haozhe Chen,…☆50Updated 6 months ago
- ☆41Updated 8 months ago
- CoT-Valve: Length-Compressible Chain-of-Thought Tuning☆73Updated 4 months ago
- ☆19Updated 4 months ago