sarahmart / HARDMath
A new dataset of difficult graduate-level applied mathematics problems; evaluations demonstrate that leading LLMs currently exhibit low accuracy in solving these problems.
β13Updated last month
Alternatives and similar repositories for HARDMath:
Users that are interested in HARDMath are comparing it to the libraries listed below
- π LINC: Logical Inference via Neurosymbolic Computation [EMNLP2023]β64Updated last year
- NaturalProver: Grounded Mathematical Proof Generation with Language Modelsβ36Updated 2 years ago
- β24Updated 7 months ago
- Collections of RLxLM experiments using minimal codesβ12Updated last month
- The official repository for the paper Multilingual Mathematical Autoformalizationβ34Updated 10 months ago
- Synthetic question-answering dataset to formally analyze the chain-of-thought output of large language models on a reasoning task.β139Updated 5 months ago
- GenRM-CoT: Data release for verification rationalesβ53Updated 5 months ago
- β27Updated 2 months ago
- Code for the paper LEGO-Prover: Neural Theorem Proving with Growing Librariesβ58Updated last year
- β83Updated 2 months ago
- The official repository of the Omni-MATH benchmark.β78Updated 3 months ago
- β64Updated last year
- The official repo for "TheoremQA: A Theorem-driven Question Answering dataset" (EMNLP 2023)β30Updated 10 months ago
- [AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracyβ56Updated 3 months ago
- Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineeringβ58Updated 3 months ago
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervisionβ119Updated 6 months ago
- β14Updated 8 months ago
- β144Updated 3 months ago
- β65Updated 11 months ago
- Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied witβ¦β119Updated 8 months ago
- β29Updated 3 months ago
- Official implementation of AAAI 2025 paper "Augmenting Math Word Problems via Iterative Question Composing"(https://arxiv.org/abs/2401.09β¦β19Updated 3 months ago
- The is the official implementation of "Lyra: Orchestrating Dual Correction in Automated Theorem Proving"β14Updated 8 months ago
- A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Modelsβ44Updated last month
- SatLM: SATisfiability-Aided Language Models using Declarative Prompting (NeurIPS 2023)β48Updated 8 months ago
- β23Updated 6 months ago
- LogicBench is a natural language question-answering dataset consisting of 25 different reasoning patterns spanning over propositional, fiβ¦β21Updated 10 months ago
- Code and data used in the paper: "Training on Incorrect Synthetic Data via RL Scales LLM Math Reasoning Eight-Fold"β29Updated 9 months ago
- Evaluate the Quality of Critiqueβ34Updated 9 months ago
- [NeurIPS'24] Official code for *π―DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*β99Updated 3 months ago