idavidrein / gpqa
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
☆333Updated 6 months ago
Alternatives and similar repositories for gpqa:
Users that are interested in gpqa are comparing it to the libraries listed below
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆187Updated 4 months ago
- ☆166Updated last week
- Reproducible, flexible LLM evaluations☆191Updated last month
- ☆519Updated last week
- ☆326Updated 2 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆221Updated 5 months ago
- RewardBench: the first evaluation tool for reward models.☆555Updated last month
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆203Updated last year
- ☆283Updated last month
- [ACL'24] Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning☆354Updated 7 months ago
- Automatic evals for LLMs☆373Updated this week
- BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.☆198Updated last week
- A simple unified framework for evaluating LLMs☆209Updated last week
- The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024]☆230Updated last month
- ☆423Updated 9 months ago
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them☆482Updated 10 months ago
- Official repo for paper: "Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't"☆204Updated last month
- The official evaluation suite and dynamic data release for MixEval.☆235Updated 5 months ago
- ☆647Updated 3 weeks ago
- Benchmark and research code for the paper SWEET-RL Training Multi-Turn LLM Agents onCollaborative Reasoning Tasks☆182Updated last week
- "Improving Mathematical Reasoning with Process Supervision" by OPENAI☆108Updated 2 weeks ago
- Data and Code for Program of Thoughts (TMLR 2023)☆268Updated 11 months ago
- LOFT: A 1 Million+ Token Long-Context Benchmark☆190Updated this week
- ☆922Updated 3 months ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆440Updated this week
- Official implementation of paper "On the Diagram of Thought" (https://arxiv.org/abs/2409.10038)☆178Updated 3 weeks ago
- Code for Paper: Training Software Engineering Agents and Verifiers with SWE-Gym☆438Updated 3 weeks ago
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆236Updated last week
- MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems☆86Updated 9 months ago
- Official implementation of paper "Cumulative Reasoning With Large Language Models" (https://arxiv.org/abs/2308.04371)☆292Updated 7 months ago