idavidrein / gpqa
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
☆201Updated 2 months ago
Alternatives and similar repositories for gpqa:
Users that are interested in gpqa are comparing it to the libraries listed below
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆170Updated 2 weeks ago
- ☆115Updated 2 months ago
- Evaluating LLMs with fewer examples☆139Updated 8 months ago
- Benchmarking LLMs with Challenging Tasks from Real Users☆202Updated last month
- Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".☆135Updated 2 months ago
- ☆256Updated 5 months ago
- The official evaluation suite and dynamic data release for MixEval.☆230Updated last month
- Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.☆141Updated last month
- Repository for the paper Stream of Search: Learning to Search in Language☆105Updated 4 months ago
- A simple unified framework for evaluating LLMs☆153Updated last month
- BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.☆167Updated 3 weeks ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆64Updated last year
- ☆124Updated this week
- Code and data accompanying our paper on arXiv "Faithful Chain-of-Thought Reasoning".☆157Updated 7 months ago
- RewardBench: the first evaluation tool for reward models.☆462Updated this week
- [ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets☆212Updated 11 months ago
- Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic's published research).☆170Updated this week
- A benchmark that challenges language models to code solutions for scientific problems☆92Updated last week
- Implementation of the Quiet-STAR paper (https://arxiv.org/pdf/2403.09629.pdf)☆44Updated 4 months ago
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆294Updated 11 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆69Updated 3 weeks ago
- Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"☆239Updated 2 months ago
- A toolkit for describing model features and intervening on those features to steer behavior.☆132Updated last month
- Can Language Models Solve Olympiad Programming?☆104Updated 4 months ago
- Sparse autoencoders☆379Updated last week
- ☆115Updated 2 months ago
- 🌍 Repository for "AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agent", ACL'24 Best Resource Pap…☆126Updated 2 weeks ago
- ☆104Updated 4 months ago
- Extract full next-token probabilities via language model APIs☆230Updated 9 months ago