allenai/OLMo-Eval-Legacy

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/allenai/OLMo-Eval-Legacy)

allenai / OLMo-Eval-Legacy

Evaluation suite for LLMs

☆379

Alternatives and similar repositories for OLMo-Eval-Legacy

Users that are interested in OLMo-Eval-Legacy are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

allenai / OLMo
View on GitHub
Modeling, training, eval, and inference code for OLMo
☆6,505Nov 24, 2025Updated 5 months ago
allenai / catwalk
View on GitHub
This project studies the performance and robustness of language models and task-adaptation methods.
☆154May 18, 2024Updated 2 years ago
allenai / dolma
View on GitHub
Data and tools for generating and inspecting OLMo pre-training data.
☆1,497Nov 5, 2025Updated 6 months ago
allenai / olmes
View on GitHub
Reproducible, flexible LLM evaluations
☆372Mar 24, 2026Updated last month
allenai / open-instruct
View on GitHub
AllenAI's post-training codebase
☆3,726Updated this week
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
nomic-ai / contrastors
View on GitHub
Train Models Contrastively in Pytorch
☆789Mar 26, 2025Updated last year
allenai / beaker-gantry
View on GitHub
Gantry provides an API that streamlines running experiments in Beaker
☆33Apr 8, 2026Updated last month
allenai / wimbd
View on GitHub
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
☆227Nov 16, 2024Updated last year
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,058May 6, 2026Updated last week
EleutherAI / lm-evaluation-harness
View on GitHub
A framework for few-shot evaluation of language models.
☆12,595May 11, 2026Updated last week
GAIR-NLP / scaleeval
View on GitHub
Scalable Meta-Evaluation of LLMs as Evaluators
☆43Feb 15, 2024Updated 2 years ago
allenai / OLMoE
View on GitHub
OLMoE: Open Mixture-of-Experts Language Models
☆1,024Sep 23, 2025Updated 7 months ago
huggingface / lighteval
View on GitHub
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆2,415May 7, 2026Updated last week
JinjieNi / MixEval
View on GitHub
The official evaluation suite and dynamic data release for MixEval.
☆256Nov 10, 2024Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
huggingface / alignment-handbook
View on GitHub
Robust recipes to align language models with human and AI preferences
☆5,602Apr 8, 2026Updated last month
SkyworkAI / Skywork-MoE
View on GitHub
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
☆140Jun 12, 2024Updated last year
huggingface / nanotron
View on GitHub
Minimalistic large language model 3D-parallelism training
☆2,690Apr 7, 2026Updated last month
casmlab / NPHardEval
View on GitHub
Repository for NPHardEval, a quantified-dynamic benchmark of LLMs
☆64Mar 26, 2024Updated 2 years ago
baaivision / JudgeLM
View on GitHub
[ICLR 2025 Spotlight] An open-sourced LLM judge for evaluating LLM-generated answers.
☆430Feb 11, 2025Updated last year
arcee-ai / mergekit
View on GitHub
Tools for merging pretrained large language models.
☆7,083May 6, 2026Updated last week
tatsu-lab / alpaca_eval
View on GitHub
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
☆1,986Aug 9, 2025Updated 9 months ago
mlfoundations / dclm
View on GitHub
DataComp for Language Models
☆1,442Sep 9, 2025Updated 8 months ago
allenai / datamap-rs
View on GitHub
Data mapping framework for rust stuff
☆52Mar 25, 2026Updated last month
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
EleutherAI / pythia
View on GitHub
The hub for EleutherAI's work on interpretability and learning dynamics
☆2,797Nov 15, 2025Updated 6 months ago
liyucheng09 / Contamination_Detector
View on GitHub
Lightweight tool to identify Data Contamination in LLMs evaluation
☆52Mar 8, 2024Updated 2 years ago
stanford-crfm / helm
View on GitHub
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models …
☆2,789Updated this week
open-compass / opencompass
View on GitHub
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, …
☆7,003Updated this week
allenai / OLMo-core
View on GitHub
PyTorch building blocks for the OLMo ecosystem
☆1,214Updated this week
allenai / WildBench
View on GitHub
Benchmarking LLMs with Challenging Tasks from Real Users
☆254Nov 3, 2024Updated last year
microsoftarchive / promptbench
View on GitHub
A unified evaluation framework for large language models
☆2,801Feb 20, 2026Updated 2 months ago
argilla-io / distilabel
View on GitHub
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆3,217Apr 27, 2026Updated 3 weeks ago
allenai / dolma3
View on GitHub
☆77Apr 20, 2026Updated 3 weeks ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
huggingface / llm-swarm
View on GitHub
Manage scalable open LLM inference endpoints in Slurm clusters
☆287Jul 11, 2024Updated last year
IBM / benchbench
View on GitHub
A package dedicated for running benchmark agreement testing
☆18Sep 18, 2025Updated 8 months ago
NVIDIA / Megatron-LM
View on GitHub
Ongoing research training transformer models at scale
☆16,340Updated this week
tjunlp-lab / Awesome-LLMs-Evaluation-Papers
View on GitHub
The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.
☆803May 8, 2024Updated 2 years ago
lucidrains / self-rewarding-lm-pytorch
View on GitHub
Implementation of the training framework proposed in Self-Rewarding Language Model, from MetaAI
☆1,409Apr 11, 2024Updated 2 years ago
allenai / olmo-cookbook
View on GitHub
OLMost every training recipe you need to perform data interventions with the OLMo family of models.
☆73Apr 28, 2026Updated 2 weeks ago
openai / simple-evals
View on GitHub
☆4,487Apr 22, 2026Updated 3 weeks ago