idavidrein / gpqaLinks

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

☆417

Alternatives and similar repositories for gpqa

Users that are interested in gpqa are comparing it to the libraries listed below

Sorting:

WildEval / ZeroEval
A simple unified framework for evaluating LLMs
☆251Updated 6 months ago
TIGER-AI-Lab / MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]
☆297Updated 7 months ago
allenai / olmes
Reproducible, flexible LLM evaluations
☆257Updated last week
project-numina / aimo-progress-prize
☆465Updated last year
allenai / reward-bench
RewardBench: the first evaluation tool for reward models.
☆643Updated 4 months ago
huggingface / cosmopedia
☆544Updated 11 months ago
LiveCodeBench / LiveCodeBench
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
☆688Updated 3 months ago
mlfoundations / evalchemy
Automatic evals for LLMs
☆547Updated 3 months ago
NVIDIA-NeMo / Skills
A project to improve skills of large language models
☆587Updated last week
allenai / WildBench
Benchmarking LLMs with Challenging Tasks from Real Users
☆242Updated 11 months ago
JinjieNi / MixEval
The official evaluation suite and dynamic data release for MixEval.
☆250Updated 11 months ago
da03 / Internalize_CoT_Step_by_Step
☆195Updated 6 months ago
suzgunmirac / BIG-Bench-Hard
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
☆519Updated last year
OSU-NLP-Group / GrokkedTransformer
Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'
☆233Updated 3 months ago
ezelikman / STaR
Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)
☆214Updated 2 years ago
tianyi-lab / Reflection_Tuning
[ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
☆362Updated last year
SWE-Gym / SWE-Gym
Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]
☆553Updated 2 months ago
bigcode-project / bigcodebench
[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI
☆440Updated last week
lmarena / arena-hard-auto
Arena-Hard-Auto: An automatic LLM benchmark.
☆942Updated 4 months ago
waterhorse1 / LLM_Tree_Search
(ICML 2024) Alphazero-like Tree-Search can guide large language model decoding and training
☆283Updated last year
ezelikman / quiet-star
Code for Quiet-STaR
☆739Updated last year
lm-sys / llm-decontaminator
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
☆311Updated last year
google-deepmind / long-form-factuality
Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".
☆643Updated 2 months ago
eth-sri / matharena
Evaluation of LLMs on latest math competitions
☆172Updated this week
lukasberglund / reversal_curse
☆296Updated last year
xfactlab / orpo
Official repository for ORPO
☆463Updated last year
google-deepmind / loft
LOFT: A 1 Million+ Token Long-Context Benchmark
☆218Updated 4 months ago
MARIO-Math-Reasoning / Super_MARIO
☆342Updated 4 months ago
booydar / babilong
BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.
☆215Updated last month
LeonGuertler / TextArena
A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning
☆291Updated 2 weeks ago