allenai / olmes
Reproducible, flexible LLM evaluations
☆129Updated last month
Alternatives and similar repositories for olmes:
Users that are interested in olmes are comparing it to the libraries listed below
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆150Updated last month
- ☆129Updated last month
- Benchmarking LLMs with Challenging Tasks from Real Users☆210Updated 2 months ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆131Updated 3 months ago
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆154Updated 9 months ago
- Repo of paper "Free Process Rewards without Process Labels"☆110Updated 2 weeks ago
- ☆143Updated last week
- A project to improve skills of large language models☆239Updated this week
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality☆172Updated 5 months ago
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning"☆98Updated 6 months ago
- Self-Alignment with Principle-Following Reward Models☆152Updated 11 months ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆204Updated 8 months ago
- LOFT: A 1 Million+ Token Long-Context Benchmark☆168Updated 3 months ago
- The HELMET Benchmark☆109Updated last week
- Data and code for our paper "Why Does the Effective Context Length of LLMs Fall Short?"☆69Updated 2 months ago
- ☆250Updated last year
- ☆94Updated 7 months ago
- PyTorch building blocks for OLMo☆49Updated this week
- [EMNLP 2024] Source code for the paper "Learning Planning-based Reasoning with Trajectory Collection and Process Rewards Synthesizing".☆66Updated 2 weeks ago
- 🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.☆101Updated this week
- Official repository for paper "Weak-to-Strong Extrapolation Expedites Alignment"☆71Updated 7 months ago
- Scalable Meta-Evaluation of LLMs as Evaluators☆42Updated 11 months ago
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆176Updated 6 months ago
- A dataset of LLM-generated chain-of-thought steps annotated with mistake location.☆77Updated 5 months ago
- The official evaluation suite and dynamic data release for MixEval.☆233Updated 2 months ago
- Language models scale reliably with over-training and on downstream tasks☆96Updated 9 months ago
- ☆51Updated 2 months ago
- DSIR large-scale data selection framework for language model training☆242Updated 9 months ago
- Critique-out-Loud Reward Models☆48Updated 3 months ago