sarahmart / HARDMath
A new dataset of difficult graduate-level applied mathematics problems; evaluations demonstrate that leading LLMs currently exhibit low accuracy in solving these problems.
β17Updated 3 months ago
Alternatives and similar repositories for HARDMath
Users that are interested in HARDMath are comparing it to the libraries listed below
Sorting:
- π LINC: Logical Inference via Neurosymbolic Computation [EMNLP2023]β68Updated last year
- β25Updated 8 months ago
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervisionβ120Updated 8 months ago
- The official implementation of "Self-play LLM Theorem Provers with Iterative Conjecturing and Proving"β79Updated last month
- GenRM-CoT: Data release for verification rationalesβ59Updated 6 months ago
- Revisiting Mid-training in the Era of RL Scalingβ37Updated 3 weeks ago
- Collections of RLxLM experiments using minimal codesβ12Updated 2 months ago
- [AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracyβ61Updated 5 months ago
- β83Updated 3 months ago
- β67Updated last year
- [NeurIPS'24] Official code for *π―DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*β105Updated 5 months ago
- Code release for "Debating with More Persuasive LLMs Leads to More Truthful Answers"β105Updated last year
- Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineeringβ57Updated 5 months ago
- Test-time-training on nearest neighbors for large language modelsβ41Updated last year
- Code & data for ICLR 2024 spotlight paper: π―MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Dataβ41Updated 11 months ago
- β151Updated 4 months ago
- [ICML 2025] Teaching Language Models to Critique via Reinforcement Learningβ95Updated last week
- Code and data used in the paper: "Training on Incorrect Synthetic Data via RL Scales LLM Math Reasoning Eight-Fold"β30Updated 11 months ago
- β14Updated 6 months ago
- The official code release for Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalizationβ30Updated 2 months ago
- GSM-Plus: Data, Code, and Evaluation for Enhancing Robust Mathematical Reasoning in Math Word Problems.β61Updated 10 months ago
- Implementation of the Quiet-STAR paper (https://arxiv.org/pdf/2403.09629.pdf)β53Updated 9 months ago
- Can Language Models Solve Olympiad Programming?β116Updated 4 months ago
- β29Updated 4 months ago
- [NeurIPS 2024] Can LLMs Learn by Teaching for Better Reasoning? A Preliminary Studyβ49Updated 5 months ago
- The official repository of the Omni-MATH benchmark.β82Updated 4 months ago
- Code for the paper LEGO-Prover: Neural Theorem Proving with Growing Librariesβ62Updated last year
- The official repo for "TheoremQA: A Theorem-driven Question Answering dataset" (EMNLP 2023)β31Updated last year
- [NeurIPS'24 Spotlight] Observational Scaling Lawsβ54Updated 7 months ago
- β47Updated last week