allenai / olmes
Reproducible, flexible LLM evaluations
☆191Updated last month
Alternatives and similar repositories for olmes:
Users that are interested in olmes are comparing it to the libraries listed below
- Benchmarking LLMs with Challenging Tasks from Real Users☆221Updated 5 months ago
- ☆149Updated 4 months ago
- The HELMET Benchmark☆135Updated last week
- 🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.☆331Updated this week
- ☆283Updated last month
- The official evaluation suite and dynamic data release for MixEval.☆235Updated 5 months ago
- Homepage for ProLong (Princeton long-context language models) and paper "How to Train Long-Context Language Models (Effectively)"☆175Updated last month
- ☆166Updated last week
- Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'☆187Updated 4 months ago
- PyTorch building blocks for the OLMo ecosystem☆197Updated this week
- A simple unified framework for evaluating LLMs☆209Updated last week
- Code and example data for the paper: Rule Based Rewards for Language Model Safety☆186Updated 9 months ago
- A project to improve skills of large language models☆295Updated this week
- ☆70Updated 5 months ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆205Updated 11 months ago
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆139Updated 5 months ago
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning☆191Updated last month
- LOFT: A 1 Million+ Token Long-Context Benchmark☆190Updated this week
- RewardBench: the first evaluation tool for reward models.☆555Updated last month
- Repo of paper "Free Process Rewards without Process Labels"☆143Updated last month
- A simple toolkit for benchmarking LLMs on mathematical reasoning tasks. 🧮✨☆203Updated last year
- ☆114Updated 2 months ago
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆459Updated last year
- Code for "Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate"☆139Updated this week
- Code and Data for "Long-context LLMs Struggle with Long In-context Learning" [TMLR2025]☆105Updated 2 months ago
- ☆96Updated 9 months ago
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling☆101Updated 3 months ago
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆236Updated last week
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality☆185Updated 8 months ago
- BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.☆198Updated last week