GAIR-NLP / OlympicArenaLinks

[NeurIPS 2024] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

☆105

Alternatives and similar repositories for OlympicArena

Users that are interested in OlympicArena are comparing it to the libraries listed below

Sorting:

HKUNLP / critic-rl
[ICML 2025] Teaching Language Models to Critique via Reinforcement Learning
☆114Updated 5 months ago
KbsdJames / Omni-MATH
The official repository of the Omni-MATH benchmark.
☆88Updated 10 months ago
GAIR-NLP / AIME-Preview
☆73Updated 7 months ago
chujiezheng / LLM-Extrapolation
Official repository for ACL 2025 paper "Model Extrapolation Expedites Alignment"
☆75Updated 5 months ago
hkust-nlp / llm-compression-intelligence
Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]
☆142Updated last year
hkust-nlp / B-STaR
B-STAR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
☆85Updated 5 months ago
GAIR-NLP / weak-to-strong-reasoning
☆58Updated last year
THUDM / T1
RL Scaling and Test-Time Scaling (ICML'25)
☆111Updated 9 months ago
wwxu21 / CUT
Source code of "Reasons to Reject? Aligning Language Models with Judgments"
☆58Updated last year
GAIR-NLP / OctoThinker
Revisiting Mid-training in the Era of Reinforcement Learning Scaling
☆177Updated 3 months ago
TIGER-AI-Lab / MAmmoTH2
Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
☆148Updated 11 months ago
ernie-research / Tool-Augmented-Reward-Model
[ICLR'24 spotlight] Tool-Augmented Reward Modeling
☆51Updated 4 months ago
WHGTyen / BIG-Bench-Mistake
A dataset of LLM-generated chain-of-thought steps annotated with mistake location.
☆82Updated last year
mathllm / MathCoder2
☆69Updated last year
google-deepmind / bbeh
☆99Updated 5 months ago
Gen-Verse / CURE
[NeurIPS 2025 Spotlight] ReasonFlux-Coder: Open-Source LLM Coders with Co-Evolving Reinforcement Learning
☆126Updated last month
zitian-gao / SC-MCTS
Interpretable Contrastive Monte Carlo Tree Search Reasoning
☆48Updated 11 months ago
GAIR-NLP / ReAlign
Reformatted Alignment
☆112Updated last year
open-compass / MathBench
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
☆107Updated 5 months ago
allenai / easy-to-hard-generalization
Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"
☆48Updated last year
test-time-interaction / TTI
☆63Updated 4 months ago
xufangzhi / phi-Decoding
[ACL 2025] An inference-time decoding strategy with adaptive foresight sampling
☆105Updated 5 months ago
TIGER-AI-Lab / General-Reasoner
General Reasoner: Advancing LLM Reasoning Across All Domains [NeurIPS25]
☆186Updated 4 months ago
TIGER-AI-Lab / AceCoder
The official repo for "AceCoder: Acing Coder RL via Automated Test-Case Synthesis" [ACL25]
☆91Updated 6 months ago
satori-reasoning / Satori
[ICML 2025] Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
☆110Updated 4 months ago
GAIR-NLP / ReasonEval
[AAAI 2025 oral] Evaluating Mathematical Reasoning Beyond Accuracy
☆73Updated 2 weeks ago
RUCAIBox / JiuZhang3.0
The code and data for the paper JiuZhang3.0
☆49Updated last year
yyDing1 / ScaleQuest
[ACL-25] We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLMs.
☆68Updated 11 months ago
SynthLabsAI / big-math
A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models
☆65Updated 8 months ago
open-compass / CriticEval
[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
☆47Updated 10 months ago