hiyouga / MathRuler
A light-weight tool for evaluating LLMs in rule-based ways.
☆46Updated 2 months ago
Alternatives and similar repositories for MathRuler:
Users that are interested in MathRuler are comparing it to the libraries listed below
- ☆37Updated 2 weeks ago
- The official repository of the Omni-MATH benchmark.☆80Updated 4 months ago
- [NeurIPS'24] Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models☆58Updated 4 months ago
- [ICLR 2025] Benchmarking Agentic Workflow Generation☆79Updated 2 months ago
- Official codebase for "GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning".☆64Updated last week
- ☆125Updated 3 weeks ago
- xVerify: Efficient Answer Verifier for Reasoning Model Evaluations☆75Updated last week
- ☆55Updated 6 months ago
- [NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.☆115Updated last month
- ☆57Updated last month
- Code for Paper: Teaching Language Models to Critique via Reinforcement Learning☆94Updated last week
- ☆60Updated this week
- Implementation for the research paper "Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision".☆52Updated 4 months ago
- Interpretable Contrastive Monte Carlo Tree Search Reasoning☆48Updated 5 months ago
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling☆101Updated 3 months ago
- ☆101Updated 4 months ago
- a-m-team's exploration in large language modeling☆49Updated 3 weeks ago
- ☆41Updated this week
- ☆41Updated 2 weeks ago
- ☆149Updated 4 months ago
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆101Updated last month
- [NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*☆101Updated 4 months ago
- [NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs☆108Updated this week
- [NeurIPS 2024] MATH-Vision dataset and code to measure multimodal mathematical reasoning capabilities.☆103Updated 2 weeks ago
- Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*☆100Updated 2 months ago
- [AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy☆60Updated 4 months ago
- Unofficial Implementation of Chain-of-Thought Reasoning Without Prompting☆32Updated last year
- Reformatted Alignment☆115Updated 7 months ago
- Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning☆173Updated last month
- Official repository for paper "Weak-to-Strong Extrapolation Expedites Alignment"☆74Updated 10 months ago