open-compass / MathBench
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
☆97Updated 8 months ago
Alternatives and similar repositories for MathBench:
Users that are interested in MathBench are comparing it to the libraries listed below
- The official repository of the Omni-MATH benchmark.☆77Updated 3 months ago
- [ACL 2024]Official GitHub repo for OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scie…☆135Updated 8 months ago
- We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLMs.☆60Updated 4 months ago
- ☆81Updated 11 months ago
- Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied wit…☆118Updated 8 months ago
- Implementation for the research paper "Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision".☆51Updated 3 months ago
- [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues☆77Updated 8 months ago
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models☆250Updated 6 months ago
- ☆143Updated 3 months ago
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]☆130Updated 6 months ago
- The implementation of paper "LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Fee…☆38Updated 8 months ago
- On Memorization of Large Language Models in Logical Reasoning☆56Updated 4 months ago
- Benchmarking Complex Instruction-Following with Multiple Constraints Composition (NeurIPS 2024 Datasets and Benchmarks Track)☆72Updated last month
- Reformatted Alignment☆115Updated 6 months ago
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling☆95Updated 2 months ago
- [NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI☆96Updated 2 weeks ago
- [NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs☆39Updated 3 months ago
- [NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*☆98Updated 3 months ago
- [AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy☆55Updated 3 months ago
- [COLING 2025] ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios☆65Updated 3 months ago
- ☆166Updated last month
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.☆135Updated 2 weeks ago
- Official implementation of the paper "From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large L…☆47Updated 9 months ago
- ☆263Updated 8 months ago
- ☆65Updated 11 months ago
- A dataset of LLM-generated chain-of-thought steps annotated with mistake location.☆79Updated 7 months ago
- Fantastic Data Engineering for Large Language Models☆83Updated 2 months ago
- [EMNLP 2024] LongAlign: A Recipe for Long Context Alignment of LLMs☆246Updated 3 months ago
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆229Updated last month
- Code for Paper: Teaching Language Models to Critique via Reinforcement Learning☆84Updated last month