TIGER-AI-Lab / MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
โ206Updated 3 weeks ago
Alternatives and similar repositories for MMLU-Pro:
Users that are interested in MMLU-Pro are comparing it to the libraries listed below
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. ๐งฎโจโ191Updated 11 months ago
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuningโ349Updated 6 months ago
- Benchmarking LLMs with Challenging Tasks from Real Usersโ219Updated 4 months ago
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"โ131Updated last month
- Reproducible, flexible LLM evaluationsโ180Updated this week
- [EMNLP 2024] LongAlign: A Recipe for Long Context Alignment of LLMsโ247Updated 3 months ago
- The official evaluation suite and dynamic data release for MixEval.โ233Updated 4 months ago
- โ502Updated 4 months ago
- [EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuningโ236Updated last year
- Codes for the paper "โBench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718โ313Updated 6 months ago
- โ307Updated 9 months ago
- [ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"โ393Updated 5 months ago
- โ559Updated last week
- This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Ajiโฆโ218Updated last year
- Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"โ475Updated 2 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasksโ141Updated 6 months ago
- โ264Updated 8 months ago
- FuseAI Projectโ550Updated 2 months ago
- Code and example data for the paper: Rule Based Rewards for Language Model Safetyโ183Updated 8 months ago
- ๐พ OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.โ283Updated this week
- RewardBench: the first evaluation tool for reward models.โ532Updated last month
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learningโ148Updated last week
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"โ229Updated last month
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".โ195Updated 5 months ago
- A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.โ165Updated last week
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scalingโ95Updated 2 months ago
- A lightweight reproduction of DeepSeek-R1-Zero with indepth analysis of self-reflection behavior.โ212Updated this week
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)โ131Updated 4 months ago
- โ312Updated 6 months ago
- โ260Updated last week