TIGER-AI-Lab / MMLU-ProLinks
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
☆335Updated 2 months ago
Alternatives and similar repositories for MMLU-Pro
Users that are interested in MMLU-Pro are comparing it to the libraries listed below
Sorting:
- Reproducible, flexible LLM evaluations☆337Updated last week
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆366Updated last year
- A simple unified framework for evaluating LLMs☆261Updated 9 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆245Updated last year
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark☆466Updated last year
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.☆456Updated last year
- Automatic evals for LLMs☆579Updated last month
- LongRoPE is a novel method that can extends the context window of pre-trained LLMs to an impressive 2048k tokens.☆277Updated 3 months ago
- Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"☆535Updated last year
- Official repo for paper: "Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't"☆273Updated 3 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆273Updated last year
- ☆330Updated 8 months ago
- [ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI☆475Updated last month
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate" [COLM 2025]☆180Updated 7 months ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆786Updated 6 months ago
- [NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.☆220Updated 8 months ago
- The official evaluation suite and dynamic data release for MixEval.☆255Updated last year
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆261Updated 9 months ago
- [NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents☆538Updated this week
- ☆203Updated 9 months ago
- RewardBench: the first evaluation tool for reward models.☆685Updated last week
- 🌍 AppWorld: A Controllable World of Apps and People for Benchmarking Function Calling and Interactive Coding Agent, ACL'24 Best Resource…☆371Updated 2 months ago
- [ICML 2025 Oral] CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction☆567Updated 9 months ago
- [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale☆265Updated 7 months ago
- [EMNLP 2023] Adapting Language Models to Compress Long Contexts☆327Updated last year
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆261Updated 8 months ago
- Official repository for ORPO☆469Updated last year
- [ICLR 2026] Learning to Reason without External Rewards☆389Updated last week
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆205Updated last year
- ☆320Updated last year