hkust-nlp / RL-Verifier-RobustnessLinks

From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning.

☆23

Alternatives and similar repositories for RL-Verifier-Robustness

Users that are interested in RL-Verifier-Robustness are comparing it to the libraries listed below

Sorting:

sail-sg / ActivePRM
☆19Updated 7 months ago
jinzhuoran / RAG-RewardBench
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment
☆16Updated 11 months ago
kiaia / GIRAFFE
Extending context length of visual language models
☆12Updated 11 months ago
hkust-nlp / mstar
[ICML 2025] M-STAR (Multimodal Self-Evolving TrAining for Reasoning) Project. Diving into Self-Evolving Training for Multimodal Reasoning
☆69Updated 4 months ago
KbsdJames / omni-math-rule
The rule-based evaluation subset and code implementation of Omni-MATH
☆24Updated 10 months ago
GAIR-NLP / BeHonest
BeHonest: Benchmarking Honesty in Large Language Models
☆34Updated last year
GAIR-NLP / weak-to-strong-reasoning
☆58Updated last year
rookie-joe / AutoPSV
☆50Updated last year
hkust-nlp / Laser
Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
☆59Updated 5 months ago
Yifan-Song793 / GoodBadGreedy
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
☆30Updated last year
sunnweiwei / FoldAgent
☆57Updated 3 weeks ago
SihengLi99 / LLM-Honesty-Survey
[2025-TMLR] A Survey on the Honesty of Large Language Models
☆62Updated 11 months ago
hanxuhu / SeqIns
The repository of the project "Fine-tuning Large Language Models with Sequential Instructions", code base comes from open-instruct and LA…
☆30Updated 11 months ago
kkk-an / UltraIF
Code of EMNLP 2025 paper 'UltraIF: Advancing Instruction Following from the Wild'.
☆19Updated 7 months ago
sail-sg / dice
Official implementation of Bootstrapping Language Models via DPO Implicit Rewards
☆44Updated 7 months ago
GAIR-NLP / self-improvement-reversal
☆13Updated last year
thu-coai / BARREL
☆16Updated 5 months ago
TingchenFu / MathIF
instruction-following benchmark for large reasoning models
☆45Updated 3 months ago
hkust-nlp / GUIMid
☆21Updated 6 months ago
GAIR-NLP / ReasonEval
[AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy
☆76Updated last month
test-time-interaction / TTI
☆64Updated 5 months ago
ZhentingWang / DUMP
☆32Updated 6 months ago
THU-KEG / RM-Bench
[ICLR 25 Oral] RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
☆67Updated 4 months ago
RLHFlow / RAFT
This is an official implementation of the Reward rAnked Fine-Tuning Algorithm (RAFT), also known as iterative best-of-n fine-tuning or re…
☆37Updated last year
zzzhr97 / SpecBench
☆22Updated 3 weeks ago
hkust-nlp / Activation_Decoding
In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation (ICML 2024)
☆62Updated last year
lukahhcm / Awesome_Environment_Scaling
Resources and paper list for 'Scaling Environments for Agents'. This repository accompanies our survey on how environments contribute to …
☆25Updated last week
bobxwu / learning-from-rewards-llm-papers
A comrephensive collection of learning from rewards in the post-training and test-time scaling of LLMs, with a focus on both reward model…
☆58Updated 5 months ago
UCSB-NLP-Chang / ThinkPrune
☆45Updated last month
Kwai-Klear / RLEP
RL with Experience Replay
☆48Updated 3 months ago