wenhuchen / TheoremQA
The dataset and code for paper: TheoremQA: A Theorem-driven Question Answering dataset
☆154Updated 7 months ago
Related projects ⓘ
Alternatives and complementary repositories for TheoremQA
- Self-Alignment with Principle-Following Reward Models☆147Updated 8 months ago
- A dataset of LLM-generated chain-of-thought steps annotated with mistake location.☆73Updated 3 months ago
- ☆133Updated last year
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]☆127Updated 2 months ago
- LOFT: A 1 Million+ Token Long-Context Benchmark☆146Updated 3 weeks ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆124Updated 3 weeks ago
- ☆95Updated last week
- Official Repo for ICLR 2024 paper MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang*, Ziha…☆104Updated 5 months ago
- This is the repo for the paper Shepherd -- A Critic for Language Model Generation☆213Updated last year
- ☆171Updated last year
- ☆75Updated last month
- Benchmarking LLMs with Challenging Tasks from Real Users☆198Updated 3 weeks ago
- Curation of resources for LLM mathematical reasoning, most of which are screened by @tongyx361 to ensure high quality and accompanied wit…☆84Updated 4 months ago
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆167Updated last month
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models☆221Updated 2 months ago
- ☆119Updated 6 months ago
- [ICLR 2024] Evaluating Large Language Models at Evaluating Instruction Following☆118Updated 4 months ago
- [EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning☆214Updated last year
- A unified benchmark for math reasoning☆87Updated last year
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆105Updated 6 months ago
- Simple next-token-prediction for RLHF☆220Updated last year
- ☆103Updated 4 months ago
- ☆192Updated 3 months ago
- ☆149Updated 10 months ago
- ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)☆235Updated 7 months ago
- Code and data accompanying our paper on arXiv "Faithful Chain-of-Thought Reasoning".☆155Updated 6 months ago
- This is the official repository of the paper "OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI"☆86Updated last month
- Data and code for the ICLR 2023 paper "Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning".☆145Updated 10 months ago
- [NeurIPS 2023 D&B] Code repository for InterCode benchmark https://arxiv.org/abs/2306.14898☆194Updated 6 months ago
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning"☆91Updated 4 months ago