wellecks / lm-evaluation-harness
A framework for few-shot evaluation of autoregressive language models.
☆23Updated 10 months ago
Related projects ⓘ
Alternatives and complementary repositories for lm-evaluation-harness
- Language models scale reliably with over-training and on downstream tasks☆94Updated 7 months ago
- ☆75Updated last month
- Simple and efficient pytorch-native transformer training and inference (batched)☆61Updated 7 months ago
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]☆127Updated last month
- ☆50Updated 5 months ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆111Updated last week
- ☆31Updated last year
- A unified benchmark for math reasoning☆87Updated last year
- A dataset of LLM-generated chain-of-thought steps annotated with mistake location.☆72Updated 2 months ago
- Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision☆95Updated 2 months ago
- [EMNLP 2023, Findings] GRACE: Discriminator-Guided Chain-of-Thought Reasoning☆44Updated 3 weeks ago
- ☆30Updated last year
- ☆50Updated last week
- Code for the paper "The Impact of Positional Encoding on Length Generalization in Transformers", NeurIPS 2023☆124Updated 6 months ago
- Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"☆44Updated 9 months ago
- ☆85Updated 11 months ago
- ☆37Updated 6 months ago
- GSM-Plus: Data, Code, and Evaluation for Enhancing Robust Mathematical Reasoning in Math Word Problems.☆45Updated 4 months ago
- ☆33Updated 2 months ago
- ☆50Updated last year
- ☆103Updated 4 months ago
- ☆71Updated 6 months ago
- Repo for ICML23 "Why do Nearest Neighbor Language Models Work?"☆56Updated last year
- Code for the paper "VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment"☆77Updated 2 weeks ago
- ☆75Updated last year
- ☆65Updated 7 months ago
- Self-Alignment with Principle-Following Reward Models☆148Updated 8 months ago
- ☆38Updated 6 months ago
- Repository for NPHardEval, a quantified-dynamic benchmark of LLMs☆48Updated 7 months ago
- The official code of EMNLP 2022, "SCROLLS: Standardized CompaRison Over Long Language Sequences".☆68Updated 9 months ago