idavidrein / gpqa
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
☆139Updated 5 months ago
Related projects: ⓘ
- Benchmarking LLMs with Challenging Tasks from Real Users☆182Updated last month
- Code for the paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆140Updated 3 months ago
- ModuleFormer is a MoE-based architecture that includes two different types of experts: stick-breaking attention heads and feedforward exp…☆218Updated 5 months ago
- ☆224Updated 3 months ago
- BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.☆139Updated last month
- Extract full next-token probabilities via language model APIs☆226Updated 6 months ago
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆264Updated 9 months ago
- ☆105Updated this week
- Functional Benchmarks and the Reasoning Gap☆74Updated last month
- ☆259Updated 10 months ago
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆182Updated 4 months ago
- [ICLR 2024 Spotlight] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets☆209Updated 8 months ago
- The official evaluation suite and dynamic data release for MixEval.☆200Updated last week
- A curated collection of LLM reasoning and planning resources, including key papers, limitations, benchmarks, and additional learning mate…☆148Updated 3 weeks ago
- Code and data accompanying our paper on arXiv "Faithful Chain-of-Thought Reasoning".☆151Updated 4 months ago
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality☆135Updated last month
- A package to generate summaries of long-form text and evaluate the coherence of these summaries. Official package for our ICLR 2024 paper…☆97Updated 2 weeks ago
- RewardBench: the first evaluation tool for reward models.☆352Updated last week
- Code and Data for Tau-Bench☆91Updated this week
- The GitHub repo for Goal Driven Discovery of Distributional Differences via Language Descriptions☆68Updated last year
- Official code for "MAmmoTH2: Scaling Instructions from the Web"☆106Updated last week
- Evaluating LLMs with fewer examples☆131Updated 5 months ago
- ☆74Updated this week
- LOFT: A 1 Million+ Token Long-Context Benchmark☆127Updated 3 weeks ago
- ☆77Updated last month
- Code for In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering☆130Updated 2 months ago
- Attribute (or cite) statements generated by LLMs back to in-context information.☆107Updated 2 weeks ago
- [EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning☆201Updated 10 months ago
- datasets from the paper "Towards Understanding Sycophancy in Language Models"☆59Updated 10 months ago
- Steering vectors for transformer language models in Pytorch / Huggingface☆52Updated 2 months ago