HumanEval-V / HumanEval-V-Benchmark
A Lightweight Visual Understanding and Reasoning Benchmark for Evaluating Large Multimodal Models through Coding Tasks
☆18Updated last month
Alternatives and similar repositories for HumanEval-V-Benchmark:
Users that are interested in HumanEval-V-Benchmark are comparing it to the libraries listed below
- SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Model https://arxiv.org/pdf/2411.02433☆18Updated last month
- ☆52Updated 2 weeks ago
- ☆57Updated 4 months ago
- M-STAR (Multimodal Self-Evolving TrAining for Reasoning) Project. Diving into Self-Evolving Training for Multimodal Reasoning☆51Updated 3 weeks ago
- [NeurIPS'24] Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models☆54Updated last month
- Source code for MMEvalPro, a more trustworthy and efficient benchmark for evaluating LMMs☆22Updated 3 months ago
- [ICML 2024] Self-Infilling Code Generation☆18Updated 8 months ago
- The repository of the project "Fine-tuning Large Language Models with Sequential Instructions", code base comes from open-instruct and LA…☆29Updated last month
- The rule-based evaluation subset and code implementation of Omni-MATH☆15Updated 3 weeks ago
- Training and Benchmarking LLMs for Code Preference.☆28Updated 2 months ago
- [AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy☆44Updated last month
- A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models☆16Updated last month
- This is the repo for our paper "Mr-Ben: A Comprehensive Meta-Reasoning Benchmark for Large Language Models"☆44Updated 2 months ago
- ☆20Updated 6 months ago
- A Survey on the Honesty of Large Language Models☆51Updated last month
- Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]☆57Updated 2 months ago
- ☆15Updated 2 months ago
- The official code repository for PRMBench.☆56Updated this week
- Watch Every Step! LLM Agent Learning via Iterative Step-level Process Refinement (EMNLP 2024 Main Conference)☆49Updated 3 months ago
- This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"☆44Updated 6 months ago
- XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts☆29Updated 6 months ago
- The Paper List on Data Contamination for Large Language Models Evaluation.☆86Updated last week
- [ACL 2024] The project of Symbol-LLM☆46Updated 6 months ago
- ☆28Updated 2 months ago
- UniGen: A Unified Framework for Dataset Generation via Large Language Model☆38Updated last month
- In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)☆50Updated 9 months ago
- This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"☆28Updated 6 months ago
- ☆36Updated last year
- ☆18Updated 3 months ago
- A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.☆59Updated 2 weeks ago