qtli / GSM-PlusLinks

GSM-Plus: Data, Code, and Evaluation for Enhancing Robust Mathematical Reasoning in Math Word Problems.

☆63

Alternatives and similar repositories for GSM-Plus

Users that are interested in GSM-Plus are comparing it to the libraries listed below

Sorting:

GAIR-NLP / ReasonEval
[AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy
☆73Updated 2 weeks ago
FreedomIntelligence / OVM
☆69Updated last year
SparkJiao / dpo-trajectory-reasoning
[EMNLP 2024] Source code for the paper "Learning Planning-based Reasoning with Trajectory Collection and Process Rewards Synthesizing".
☆82Updated 9 months ago
GAIR-NLP / alignment-for-honesty
☆75Updated last year
ChengpengLi1003 / DotaMath
☆30Updated 9 months ago
GAIR-NLP / BeHonest
BeHonest: Benchmarking Honesty in Large Language Models
☆34Updated last year
HillZhang1999 / ICD
Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"
☆69Updated last year
chujiezheng / LLM-Extrapolation
Official repository for ACL 2025 paper "Model Extrapolation Expedites Alignment"
☆75Updated 5 months ago
Re-Align / AlignTDS
Analyzing LLM Alignment via Token distribution shift
☆17Updated last year
HKUNLP / STRING
[ICLR'25] Data and code for our paper "Why Does the Effective Context Length of LLMs Fall Short?"
☆78Updated 10 months ago
Spico197 / MoE-SFT
🍼 Official implementation of Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts
☆41Updated last year
GAIR-NLP / weak-to-strong-reasoning
☆58Updated last year
ernie-research / Tool-Augmented-Reward-Model
[ICLR'24 spotlight] Tool-Augmented Reward Modeling
☆51Updated 4 months ago
swj0419 / in-context-pretraining
☆54Updated last year
RZFan525 / Awesome-ScalingLaws
A curated list of awesome resources dedicated to Scaling Laws for LLMs
☆79Updated 2 years ago
OpenMOSS / Say-I-Dont-Know
[ICML'2024] Can AI Assistants Know What They Don't Know?
☆83Updated last year
KbsdJames / Omni-MATH
The official repository of the Omni-MATH benchmark.
☆88Updated 10 months ago
princeton-nlp / LLMBar
[ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following
☆131Updated last year
QwenLM / PolyMath
[NeurIPS 2025 D&B Track] Evaluation Code Repo for Paper "PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts"
☆30Updated 5 months ago
hanxuhu / SeqIns
The repository of the project "Fine-tuning Large Language Models with Sequential Instructions", code base comes from open-instruct and LA…
☆29Updated 10 months ago
QwenLM / ProcessBench
Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"
☆174Updated 5 months ago
FranxYao / FlanT5-CoT-Specialization
Implementation of ICML 23 Paper: Specializing Smaller Language Models towards Multi-Step Reasoning.
☆132Updated 2 years ago
YuxiXie / SelfEval-Guided-Decoding
☆103Updated last year
CriticBench / CriticBench
[ACL 2024 Findings] CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
☆27Updated last year
KbsdJames / omni-math-rule
The rule-based evaluation subset and code implementation of Omni-MATH
☆23Updated 10 months ago
LuoXiaoHeics / Continual-Tune
☆10Updated 8 months ago
GAIR-NLP / MetaCritique
Evaluate the Quality of Critique
☆36Updated last year
nayeon7lee / FactualityPrompt
☆86Updated 2 years ago
Zhou-Zoey / RMB-Reward-Model-Benchmark
☆43Updated 6 months ago
epfl-dlab / llm-latent-language
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
☆80Updated last year