idavidrein / gpqaLinks
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
☆417Updated last year
Alternatives and similar repositories for gpqa
Users that are interested in gpqa are comparing it to the libraries listed below
Sorting:
- A simple unified framework for evaluating LLMs☆251Updated 6 months ago
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆297Updated 7 months ago
- Reproducible, flexible LLM evaluations☆257Updated last week
- ☆465Updated last year
- RewardBench: the first evaluation tool for reward models.☆643Updated 4 months ago
- ☆544Updated 11 months ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆688Updated 3 months ago
- Automatic evals for LLMs☆547Updated 3 months ago
- A project to improve skills of large language models☆587Updated last week
- Benchmarking LLMs with Challenging Tasks from Real Users☆242Updated 11 months ago
- The official evaluation suite and dynamic data release for MixEval.☆250Updated 11 months ago
- ☆195Updated 6 months ago
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them☆519Updated last year
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆233Updated 3 months ago
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)☆214Updated 2 years ago
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆362Updated last year
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆553Updated 2 months ago
- [ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI☆440Updated last week
- Arena-Hard-Auto: An automatic LLM benchmark.☆942Updated 4 months ago
- (ICML 2024) Alphazero-like Tree-Search can guide large language model decoding and training☆283Updated last year
- Code for Quiet-STaR☆739Updated last year
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆311Updated last year
- Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".☆643Updated 2 months ago
- Evaluation of LLMs on latest math competitions☆172Updated this week
- ☆296Updated last year
- Official repository for ORPO☆463Updated last year
- LOFT: A 1 Million+ Token Long-Context Benchmark☆218Updated 4 months ago
- ☆342Updated 4 months ago
- BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.☆215Updated last month
- A Collection of Competitive Text-Based Games for Language Model Evaluation and Reinforcement Learning☆291Updated 2 weeks ago