idavidrein / gpqaLinks
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
☆351Updated 8 months ago
Alternatives and similar repositories for gpqa
Users that are interested in gpqa are comparing it to the libraries listed below
Sorting:
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆247Updated 3 months ago
- RewardBench: the first evaluation tool for reward models.☆590Updated this week
- Benchmarking LLMs with Challenging Tasks from Real Users☆223Updated 7 months ago
- A simple unified framework for evaluating LLMs☆215Updated last month
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆219Updated last year
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆193Updated 6 months ago
- Reproducible, flexible LLM evaluations☆205Updated last month
- LOFT: A 1 Million+ Token Long-Context Benchmark☆198Updated last month
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆356Updated 9 months ago
- Automatic evals for LLMs☆407Updated this week
- Code for STaR: Bootstrapping Reasoning With Reasoning (NeurIPS 2022)☆206Updated 2 years ago
- A project to improve skills of large language models☆415Updated this week
- ☆562Updated last month
- Code for Quiet-STaR☆732Updated 9 months ago
- Benchmarking long-form factuality in large language models. Original code for our paper "Long-form factuality in large language models".☆613Updated last month
- Code and data for "Lost in the Middle: How Language Models Use Long Contexts"☆347Updated last year
- The official evaluation suite and dynamic data release for MixEval.☆242Updated 6 months ago
- (ICML 2024) Alphazero-like Tree-Search can guide large language model decoding and training☆273Updated last year
- [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale☆248Updated 3 weeks ago
- ☆174Updated last month
- ☆330Updated this week
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them☆497Updated 11 months ago
- Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718☆329Updated 8 months ago
- ☆231Updated 9 months ago
- ☆295Updated last week
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym [ICML 2025]☆481Updated last month
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆208Updated last month
- ☆316Updated 8 months ago
- ☆744Updated last month
- ☆518Updated 6 months ago