hendrycks / testLinks

Measuring Massive Multitask Language Understanding | ICLR 2021

☆1,518

Alternatives and similar repositories for test

Users that are interested in test are comparing it to the libraries listed below

Sorting:

tatsu-lab / alpaca_eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
☆1,906Updated 3 months ago
anthropics / hh-rlhf
Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
☆1,799Updated 5 months ago
sylinrl / TruthfulQA
TruthfulQA: Measuring How Models Imitate Human Falsehoods
☆840Updated 10 months ago
openai / grade-school-math
☆1,359Updated last year
jquesnelle / yarn
YaRN: Efficient Context Window Extension of Large Language Models
☆1,637Updated last year
bigcode-project / bigcode-evaluation-harness
A framework for the evaluation of autoregressive code generation language models.
☆991Updated 4 months ago
openai / prm800k
800,000 step-level correctness labels on LLM solutions to MATH problems
☆2,073Updated 2 years ago
gkamradt / LLMTest_NeedleInAHaystack
Doing simple retrieval from LLM models at various context lengths to measure accuracy
☆2,074Updated last year
stanford-crfm / helm
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models …
☆2,549Updated this week
ruixiangcui / AGIEval
☆767Updated last year
declare-lab / instruct-eval
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
☆551Updated last year
FranxYao / chain-of-thought-hub
Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
☆2,756Updated last year
suzgunmirac / BIG-Bench-Hard
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
☆529Updated last year
MLGroupJLU / LLM-eval-survey
The official GitHub page for the survey paper "A Survey on Evaluation of Large Language Models".
☆1,581Updated 5 months ago
google-research / FLAN
☆1,552Updated 3 weeks ago
XueFuzhao / OpenMoE
A family of open-sourced Mixture-of-Experts (MoE) Large Language Models
☆1,635Updated last year
hendrycks / math
The MATH Dataset (NeurIPS 2021)
☆1,251Updated 2 months ago
jianzhnie / awesome-instruction-datasets
A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。
☆712Updated last year
allenai / natural-instructions
Expanding natural instructions
☆1,025Updated last year
allenai / dolma
Data and tools for generating and inspecting OLMo pre-training data.
☆1,345Updated 2 weeks ago
AGI-Edgerunners / LLM-Adapters
Code for our EMNLP 2023 Paper: "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models"
☆1,210Updated last year
THUDM / LongBench
LongBench v2 and LongBench (ACL 25'&24')
☆1,020Updated 10 months ago
princeton-nlp / SimPO
[NeurIPS 2024] SimPO: Simple Preference Optimization with a Reference-Free Reward
☆928Updated 9 months ago
yuchenlin / LLM-Blender
[ACL2023] We introduce LLM-Blender, an innovative ensembling framework to attain consistently superior performance by leveraging the dive…
☆970Updated last year
SinclairCoder / Instruction-Tuning-Papers
Reading list of Instruction-tuning. A trend starts from Natrural-Instruction (ACL 2022), FLAN (ICLR 2022) and T0 (ICLR 2022).
☆768Updated 2 years ago
tatsu-lab / alpaca_farm
A simulation framework for RLHF and alternatives. Develop your RLHF method without collecting human data.
☆835Updated last year
openai / human-eval
Code for the paper "Evaluating Large Language Models Trained on Code"
☆3,024Updated 10 months ago
tjunlp-lab / Awesome-LLMs-Evaluation-Papers
The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.
☆787Updated last year
EleutherAI / pythia
The hub for EleutherAI's work on interpretability and learning dynamics
☆2,676Updated last week
ContextualAI / HALOs
A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).
☆893Updated last month