TIGER-AI-Lab / MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
โ227Updated last month
Alternatives and similar repositories for MMLU-Pro:
Users that are interested in MMLU-Pro are comparing it to the libraries listed below
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. ๐งฎโจโ203Updated 11 months ago
- Codes for the paper "โBench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718โ319Updated 6 months ago
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"โ236Updated this week
- Benchmarking LLMs with Challenging Tasks from Real Usersโ220Updated 5 months ago
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.โ407Updated last year
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learningโ190Updated last month
- RewardBench: the first evaluation tool for reward models.โ555Updated last month
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuningโ354Updated 7 months ago
- A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.โ223Updated last week
- โ267Updated 8 months ago
- โ282Updated last month
- A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.โ174Updated last week
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"โ137Updated 2 months ago
- โ166Updated this week
- โ630Updated 3 weeks ago
- The official evaluation suite and dynamic data release for MixEval.โ235Updated 5 months ago
- [EMNLP 2024] LongAlign: A Recipe for Long Context Alignment of LLMsโ249Updated 4 months ago
- Reproducible, flexible LLM evaluationsโ191Updated 3 weeks ago
- A simple unified framework for evaluating LLMsโ209Updated last week
- โ308Updated 10 months ago
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data โฆโ676Updated last month
- GPQA: A Graduate-Level Google-Proof Q&A Benchmarkโ331Updated 6 months ago
- Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"โ481Updated 3 months ago
- A series of technical report on Slow Thinking with LLMโ644Updated last week
- Implementation of paper Data Engineering for Scaling Language Models to 128K Contextโ459Updated last year
- [EMNLP 2023] Adapting Language Models to Compress Long Contextsโ302Updated 7 months ago
- Code and example data for the paper: Rule Based Rewards for Language Model Safetyโ186Updated 9 months ago
- โ326Updated 2 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasksโ142Updated 7 months ago
- [ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"โ403Updated 6 months ago