MathEval is a benchmark dedicated to the holistic evaluation on mathematical capacities of LLMs.
☆86Nov 15, 2024Updated last year
Alternatives and similar repositories for MathEval
Users that are interested in MathEval are comparing it to the libraries listed below
Sorting:
- ☆158Sep 15, 2023Updated 2 years ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆273Apr 26, 2024Updated last year
- AAAI2024 Global Competition on Math Problem Solving and Reasoning☆14Oct 4, 2023Updated 2 years ago
- The official repository of the Omni-MATH benchmark.☆93Dec 22, 2024Updated last year
- ☆30Dec 27, 2024Updated last year
- [AAAI 2026] SIFThinker: Spatially-Aware Image Focus for Visual Reasoning☆23Dec 2, 2025Updated 2 months ago
- Source code for ACL 2021 paper "Automatic ICD Coding via Interactive Shared Representation Networks with Self-distillation Mechanism"☆14Jun 1, 2021Updated 4 years ago
- Code for the ACL2022 paper "Synthetic Question Value Estimation for Domain Adaptation of Question Answering"☆17Mar 21, 2022Updated 3 years ago
- [NAACL 2025] Representing Rule-based Chatbots with Transformers☆23Feb 9, 2025Updated last year
- Mix of Minimal Optimal Sets (MMOS) of dataset has two advantages for two aspects, higher performance and lower construction costs on math…☆74Jul 27, 2024Updated last year
- The dataset and code for paper: TheoremQA: A Theorem-driven Question Answering dataset☆160Apr 23, 2024Updated last year
- [AAAI 2025] Augmenting Math Word Problems via Iterative Question Composing (https://arxiv.org/abs/2401.09003)☆23Oct 2, 2025Updated 4 months ago
- LLM evaluation.☆16Nov 7, 2023Updated 2 years ago
- PULSE-EVAL☆24Jan 12, 2024Updated 2 years ago
- NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks☆20May 10, 2022Updated 3 years ago
- [ACL 2024 Findings] The official repo for "ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large …☆24May 29, 2024Updated last year
- Lightweight tool to identify Data Contamination in LLMs evaluation☆53Mar 8, 2024Updated last year
- ☆84Apr 18, 2024Updated last year
- Logiqa2.0 dataset - logical reasoning in MRC and NLI tasks☆102Aug 11, 2023Updated 2 years ago
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆107Mar 6, 2025Updated 11 months ago
- ☆25Aug 23, 2024Updated last year
- ☆27Jan 23, 2024Updated 2 years ago
- ☆26Nov 1, 2021Updated 4 years ago
- 🤖ConvRe🤯: An Investigation of LLMs’ Inefficacy in Understanding Converse Relations (EMNLP 2023)☆24Oct 10, 2023Updated 2 years ago
- [ACL 2024 Findings] CriticBench: Benchmarking LLMs for Critique-Correct Reasoning☆30Mar 5, 2024Updated last year
- ROUGE for multilingual Summarization☆25Oct 11, 2021Updated 4 years ago
- ☆31Jun 12, 2024Updated last year
- PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion☆59Feb 29, 2024Updated last year
- Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?☆57Apr 17, 2023Updated 2 years ago
- GSM-Plus: Data, Code, and Evaluation for Enhancing Robust Mathematical Reasoning in Math Word Problems.☆64Jul 8, 2024Updated last year
- A Massive Multi-Level Multi-Subject Knowledge Evaluation benchmark☆104Jul 20, 2023Updated 2 years ago
- An Experiment on Dynamic NTK Scaling RoPE☆64Nov 26, 2023Updated 2 years ago
- Graph4Tree is a simple example code for our EMNLP'20 Findings paper idea.☆26Nov 18, 2020Updated 5 years ago
- [ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.☆256Oct 30, 2024Updated last year
- Sora 中文的提示词 | 短视频提示词(prompt)技巧 | 调教指南。各种场景使用指南。学习怎么让它听你的话。兼顾了 Sora 的多场景应用。☆122Updated this week
- [ICML'24] TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks☆32Sep 20, 2024Updated last year
- [AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy☆77Oct 9, 2025Updated 4 months ago
- Oak National Academy's AI Auto Eval tools provide LLM as a judge evaluation on lesson plans and resources☆17Nov 4, 2025Updated 3 months ago
- Code and data for "MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning" [ICLR 2024]☆383Aug 25, 2024Updated last year