open-compass/MathBench

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/open-compass/MathBench)

open-compass / MathBench

[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset

☆115

Alternatives and similar repositories for MathBench

Users that are interested in MathBench are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

open-compass / Ada-LEval
View on GitHub
The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
☆56May 22, 2025Updated last year
SparksJoe / Prism
View on GitHub
A Framework for Decoupling and Assessing the Capabilities of VLMs
☆44Jun 28, 2024Updated 2 years ago
open-compass / ProSA
View on GitHub
[EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
☆29May 22, 2025Updated last year
KbsdJames / Omni-MATH
View on GitHub
The official repository of the Omni-MATH benchmark.
☆94Dec 22, 2024Updated last year
ChengpengLi1003 / DotaMath
View on GitHub
☆30Dec 27, 2024Updated last year
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
open-compass / Creation-MMBench
View on GitHub
Assessing Context-Aware Creative Intelligence in MLLMs
☆23Jul 22, 2025Updated 11 months ago
open-compass / CriticEval
View on GitHub
[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
☆49Nov 29, 2024Updated last year
GAIR-NLP / self-improvement-reversal
View on GitHub
☆13Jul 14, 2024Updated 2 years ago
GAIR-NLP / OlympicArena
View on GitHub
[NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
☆106Mar 6, 2025Updated last year
KbsdJames / MATH-Minos
View on GitHub
The implementation of paper "LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Fee…
☆38Jul 25, 2024Updated last year
hkust-nlp / llm-compression-intelligence
View on GitHub
Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]
☆150Sep 20, 2024Updated last year
GAIR-NLP / ReasonEval
View on GitHub
[AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy
☆80Oct 9, 2025Updated 9 months ago
open-compass / CompassJudger
View on GitHub
The All-in-one Judge Models introduced by Opencompass
☆119Jul 15, 2025Updated last year
OpenBMB / OlympiadBench
View on GitHub
[ACL 2024]Official GitHub repo for OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scie…
☆195Jun 8, 2025Updated last year
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
open-compass / BotChat
View on GitHub
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
☆163May 22, 2025Updated last year
GAIR-NLP / benbench
View on GitHub
Benchmarking Benchmark Leakage in Large Language Models
☆61May 20, 2024Updated 2 years ago
open-compass / GPassK
View on GitHub
[ACL 2025] Are Your LLMs Capable of Stable Reasoning?
☆33Aug 5, 2025Updated 11 months ago
tongyx361 / Awesome-LLM4Math
View on GitHub
Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied wit…
☆159Jul 12, 2024Updated 2 years ago
RUCAIBox / JiuZhang3.0
View on GitHub
The code and data for the paper JiuZhang3.0
☆49May 26, 2024Updated 2 years ago
iiis-ai / IterativeQuestionComposing
View on GitHub
[AAAI 2025] Augmenting Math Word Problems via Iterative Question Composing (https://arxiv.org/abs/2401.09003)
☆23Oct 2, 2025Updated 9 months ago
conceptmath / conceptmath
View on GitHub
[ACL 2024 Findings] The official repo for "ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large …
☆26May 29, 2024Updated 2 years ago
fzyzcjy / ai_math_paper_list
View on GitHub
AI for Mathematics Paper List
☆17Jan 14, 2025Updated last year
THUDM / ChatGLM-Math
View on GitHub
☆82Apr 18, 2024Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
ZubinGou / math-evaluation-harness
View on GitHub
A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨
☆277Apr 26, 2024Updated 2 years ago
chaochun / nlu-asdiv-dataset
View on GitHub
☆52Jul 4, 2023Updated 3 years ago
StigLidu / TURN
View on GitHub
[ICML2025] Official Repo for Paper "Optimizing Temperature for Language Models with Multi-Sample Inference"
☆23Feb 16, 2025Updated last year
hkust-nlp / dart-math
View on GitHub
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
☆120Dec 10, 2024Updated last year
JiwooKimAR / dmath
View on GitHub
☆12Feb 16, 2024Updated 2 years ago
open-compass / T-Eval
View on GitHub
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
☆312Apr 3, 2024Updated 2 years ago
allenai / unifew
View on GitHub
Unifew: Unified Fewshot Learning Model
☆18Sep 10, 2021Updated 4 years ago
THUDM / AlignBench
View on GitHub
大模型多维度中文对齐评测基准 (ACL 2024)
☆430Oct 25, 2025Updated 8 months ago
InternLM / Agent-FLAN
View on GitHub
[ACL2024 Findings] Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models
☆361Mar 22, 2024Updated 2 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
math-eval / MathEval
View on GitHub
MathEval is a benchmark dedicated to the holistic evaluation on mathematical capacities of LLMs.
☆87Nov 15, 2024Updated last year
PremiLab-Math / MathCheck
View on GitHub
[ICLR 2025] Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
☆34Oct 23, 2024Updated last year
megvii-research / basedet
View on GitHub
An object detection codebase based on MegEngine.
☆28Dec 14, 2022Updated 3 years ago
MARIO-Math-Reasoning / MARIO_EVAL
View on GitHub
☆52Mar 5, 2025Updated last year
koalazf99 / nanoverl
View on GitHub
Collections of RLxLM experiments using minimal codes
☆14Feb 17, 2025Updated last year
open-compass / code-evaluator
View on GitHub
A multi-language code evaluation tool.
☆28Jan 26, 2024Updated 2 years ago
TIGER-AI-Lab / MAmmoTH2
View on GitHub
Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
☆146Oct 27, 2024Updated last year