TIGER-AI-Lab / MMLU-ProLinks

The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]

☆305

Alternatives and similar repositories for MMLU-Pro

Users that are interested in MMLU-Pro are comparing it to the libraries listed below

Sorting:

allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆244Updated last year
WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆254Updated 6 months ago
tianyi-lab / Reflection_Tuning
[ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
☆363Updated last year
idavidrein / gpqa
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
☆420Updated last year
JinjieNi / MixEval
The official evaluation suite and dynamic data release for MixEval.
☆250Updated 11 months ago
hao-ai-lab / Dynasor
[NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.
☆198Updated 5 months ago
allenai / olmes
Reproducible, flexible LLM evaluations
☆260Updated last week
OpenBMB / Eurus
☆320Updated last year
Re-Align / URIAL
☆313Updated last year
da03 / Internalize_CoT_Step_by_Step
☆195Updated 6 months ago
fanqiwan / FuseAI
FuseAI Project
☆583Updated 9 months ago
ZubinGou / math-evaluation-harness
A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨
☆259Updated last year
eddycmu / demystify-long-cot
☆322Updated 5 months ago
OpenBMB / InfiniteBench
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
☆353Updated last year
lmarena / p2l
Prompt-to-Leaderboard
☆260Updated 5 months ago
mlfoundations / evalchemy
Automatic evals for LLMs
☆550Updated 4 months ago
voidism / DoLa
Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"
☆521Updated 9 months ago
knoveleng / open-rs
Official repo for paper: "Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't"
☆267Updated 2 weeks ago
microsoft / LongRoPE
LongRoPE is a novel method that can extends the context window of pre-trained LLMs to an impressive 2048k tokens.
☆263Updated this week
TIGER-AI-Lab / CritiqueFineTuning
Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate" [COLM 2025]
☆178Updated 3 months ago
allenai / reward-bench
RewardBench: the first evaluation tool for reward models.
☆646Updated 4 months ago
HKUNLP / ChunkLlama
[ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"
☆441Updated last year
GAIR-NLP / ProX
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
☆263Updated 3 months ago
RyanLiu112 / compute-optimal-tts
Official codebase for "Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling".
☆273Updated 8 months ago
princeton-nlp / AutoCompressors
[EMNLP 2023] Adapting Language Models to Compress Long Contexts
☆317Updated last year
sunblaze-ucb / Intuitor
Code for the paper: "Learning to Reason without External Rewards"
☆369Updated 3 months ago
QwenLM / CodeElo
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
☆55Updated 9 months ago
DataArcTech / LLM-as-a-Judge
☆151Updated 3 weeks ago
sail-sg / oat-zero
A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.
☆247Updated 6 months ago
andyrdt / refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
☆292Updated 4 months ago