ZubinGou / math-evaluation-harness
A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. ๐งฎโจ
โ94Updated 6 months ago
Related projects โ
Alternatives and complementary repositories for math-evaluation-harness
- โ51Updated 7 months ago
- Official repository for paper "Weak-to-Strong Extrapolation Expedites Alignment"โ67Updated 5 months ago
- [EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QAโ89Updated last month
- Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied witโฆโ80Updated 3 months ago
- [NeurIPS'24] Official code for *๐ฏDART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*โ74Updated last month
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Modelsโ166Updated last month
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning"โ91Updated 4 months ago
- The official repository of the Omni-MATH benchmark.โ45Updated last week
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Modelsโ215Updated last month
- [NAACL 2024 Outstanding Paper] Source code for the NAACL 2024 paper entitled "R-Tuning: Instructing Large Language Models to Say 'I Don'tโฆโ84Updated 3 months ago
- Collection of papers for scalable automated alignment.โ71Updated 2 weeks ago
- [NeurIPS 2024] Knowledge Circuits in Pretrained Transformersโ66Updated 3 weeks ago
- [ACL 2024] Long-Context Language Modeling with Parallel Encodingsโ142Updated 4 months ago
- [ICML 2024] Selecting High-Quality Data for Training Language Modelsโ141Updated 4 months ago
- Implementation of ICML 23 Paper: Specializing Smaller Language Models towards Multi-Step Reasoning.โ125Updated last year
- Code associated with Tuning Language Models by Proxy (Liu et al., 2024)โ96Updated 7 months ago
- [ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuningโ123Updated 2 months ago
- This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.โ187Updated 3 months ago
- Evaluating Mathematical Reasoning Beyond Accuracyโ37Updated 7 months ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"โ111Updated last week
- A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.โ112Updated last month
- Data and code for our paper "Why Does the Effective Context Length of LLMs Fall Short?"โ59Updated this week
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]โ124Updated 2 weeks ago
- [NeurIPS 2024] The official implementation of paper: Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs.โ61Updated 3 weeks ago
- [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialoguesโ46Updated 3 months ago
- [NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correctโ111Updated last week
- Awesome LLM Self-Consistency: a curated list of Self-consistency in Large Language Modelsโ75Updated 2 months ago
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Followingโ114Updated 4 months ago
- [EMNLP 2023] MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questionsโ101Updated last month
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factualityโ156Updated 3 months ago