ZubinGou / math-evaluation-harnessLinks

A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨

☆258

Alternatives and similar repositories for math-evaluation-harness

Users that are interested in math-evaluation-harness are comparing it to the libraries listed below

Sorting:

GAIR-NLP / LIMR
☆211Updated 8 months ago
tongyx361 / Awesome-LLM4Math
Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied wit…
☆143Updated last year
QwenLM / ProcessBench
Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"
☆174Updated 5 months ago
hkust-nlp / dart-math
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
☆115Updated 10 months ago
eddycmu / demystify-long-cot
☆323Updated 4 months ago
kanishkg / cognitive-behaviors
☆210Updated 6 months ago
Zanette-Labs / efficient-reasoning
☆67Updated 6 months ago
cmu-l3 / l1
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
☆258Updated 5 months ago
OFA-Sys / gsm8k-ScRel
Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
☆267Updated last year
nightdessert / Retrieval_Head
open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality
☆215Updated last year
MARIO-Math-Reasoning / Super_MARIO
☆342Updated 4 months ago
hahahawu / Long-to-Short-via-Model-Merging
Model merging is a highly efficient approach for long-to-short reasoning.
☆89Updated last week
TIGER-AI-Lab / verl-tool
A version of verl to support diverse tool use
☆607Updated this week
YuxiXie / MCTS-DPO
This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.
☆327Updated last year
sail-sg / oat-zero
A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.
☆247Updated 6 months ago
Hongcheng-Gao / Awesome-Long2short-on-LRMs
Awesome-Long2short-on-LRMs is a collection of state-of-the-art, novel, exciting long2short methods on large reasoning models. It contains…
☆252Updated 2 months ago
princeton-nlp / ProLong
Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"
☆231Updated last month
PRIME-RL / ImplicitPRM
Repo of paper "Free Process Rewards without Process Labels"
☆164Updated 7 months ago
hemingkx / TokenSkip
[EMNLP 2025] TokenSkip: Controllable Chain-of-Thought Compression in LLMs
☆186Updated 3 months ago
princeton-nlp / LESS
[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
☆496Updated last year
GAIR-NLP / ToRL
☆300Updated 4 months ago
LCLM-Horizon / A-Comprehensive-Survey-For-Long-Context-Language-Modeling
A Comprehensive Survey on Long Context Language Modeling
☆193Updated 3 months ago
IAAR-Shanghai / xVerify
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
☆135Updated 6 months ago
EIT-NLP / Awesome-Latent-CoT
This repository contains a regularly updated paper list for LLMs-reasoning-in-latent-space.
☆177Updated this week
CJReinforce / PURE
Official code for the paper, "Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning"
☆138Updated 3 months ago
getao / icae
The repo for In-context Autoencoder
☆145Updated last year
OpenBMB / OlympiadBench
[ACL 2024]Official GitHub repo for OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scie…
☆171Updated 4 months ago
PRIME-RL / Entropy-Mechanism-of-RL
The Entropy Mechanism of Reinforcement Learning for Large Language Model Reasoning.
☆350Updated 3 months ago
CMU-AIRe / MRT
Research Code for preprint "Optimizing Test-Time Compute via Meta Reinforcement Finetuning".
☆112Updated 2 months ago
Blueyee / Efficient-CoT-LRMs
Chain of Thoughts (CoT) is so hot! so long! We need short reasoning process!
☆69Updated 6 months ago