TIGER-AI-Lab / MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
☆242Updated 2 months ago
Alternatives and similar repositories for MMLU-Pro
Users that are interested in MMLU-Pro are comparing it to the libraries listed below
Sorting:
- Reproducible, flexible LLM evaluations☆200Updated last week
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆353Updated 8 months ago
- Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718☆323Updated 7 months ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆467Updated last week
- Benchmarking LLMs with Challenging Tasks from Real Users☆221Updated 6 months ago
- A simple unified framework for evaluating LLMs☆211Updated last month
- RewardBench: the first evaluation tool for reward models.☆566Updated last week
- [EMNLP 2024] LongAlign: A Recipe for Long Context Alignment of LLMs☆249Updated 5 months ago
- Official repository for ORPO☆452Updated 11 months ago
- [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale☆245Updated 3 weeks ago
- ☆308Updated 11 months ago
- The official evaluation suite and dynamic data release for MixEval.☆239Updated 6 months ago
- ☆515Updated 5 months ago
- ☆691Updated 2 weeks ago
- [EMNLP 2023] Adapting Language Models to Compress Long Contexts☆303Updated 8 months ago
- Automatic evals for LLMs☆388Updated this week
- ☆291Updated last month
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆193Updated this week
- [EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning☆240Updated last year
- ☆315Updated 7 months ago
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆461Updated last year
- [ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"☆408Updated 7 months ago
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark☆346Updated 7 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆211Updated last year
- FuseAI Project☆566Updated 3 months ago
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆202Updated this week
- [ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …☆699Updated last month
- [NeurIPS'24] SelfCodeAlign: Self-Alignment for Code Generation☆307Updated 2 months ago
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"☆146Updated 3 weeks ago
- [ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI☆356Updated last month