allenai/dolma

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/allenai/dolma)

allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.

☆1,527

Alternatives and similar repositories for dolma

Users that are interested in dolma are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

allenai / OLMo
View on GitHub
Modeling, training, eval, and inference code for OLMo
☆6,608Nov 24, 2025Updated 8 months ago
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,223Updated this week
allenai / wimbd
View on GitHub
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
☆229Nov 16, 2024Updated last year
allenai / OLMo-Eval-Legacy
View on GitHub
Evaluation suite for LLMs
☆378Jul 11, 2025Updated last year
allenai / open-instruct
View on GitHub
AllenAI's post-training codebase
☆3,809Updated this week
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
togethercomputer / RedPajama-Data
View on GitHub
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
☆4,972Jun 3, 2026Updated last month
huggingface / cosmopedia
View on GitHub
☆572Nov 20, 2024Updated last year
huggingface / nanotron
View on GitHub
Minimalistic large language model 3D-parallelism training
☆2,766May 26, 2026Updated 2 months ago
google-research / deduplicate-text-datasets
View on GitHub
☆1,269Jul 30, 2024Updated last year
mlfoundations / dclm
View on GitHub
DataComp for Language Models
☆1,455Sep 9, 2025Updated 10 months ago
huggingface / alignment-handbook
View on GitHub
Robust recipes to align language models with human and AI preferences
☆5,645May 26, 2026Updated 2 months ago
allenai / catwalk
View on GitHub
This project studies the performance and robustness of language models and task-adaptation methods.
☆154May 18, 2024Updated 2 years ago
EleutherAI / lm-evaluation-harness
View on GitHub
A framework for few-shot evaluation of language models.
☆13,407Jul 13, 2026Updated last week
huggingface / datablations
View on GitHub
Scaling Data-Constrained Language Models
☆344Jun 28, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
ChenghaoMou / text-dedup
View on GitHub
All-in-one text de-duplication
☆764Mar 9, 2026Updated 4 months ago
allenai / OLMoE
View on GitHub
OLMoE: Open Mixture-of-Experts Language Models
☆1,043Sep 23, 2025Updated 10 months ago
allenai / peS2o
View on GitHub
Pretraining Efficiently on S2ORC!
☆187Oct 23, 2024Updated last year
huggingface / lighteval
View on GitHub
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆2,497Jun 29, 2026Updated 3 weeks ago
NVIDIA / NeMo-Aligner
View on GitHub
Scalable toolkit for efficient model alignment
☆851Oct 6, 2025Updated 9 months ago
mosaicml / llm-foundry
View on GitHub
LLM training code for Databricks foundation models
☆4,431Mar 25, 2026Updated 4 months ago
argilla-io / distilabel
View on GitHub
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆3,344Updated this week
EleutherAI / pythia
View on GitHub
The hub for EleutherAI's work on interpretability and learning dynamics
☆2,863Nov 15, 2025Updated 8 months ago
arcee-ai / mergekit
View on GitHub
Tools for merging pretrained large language models.
☆7,261Jun 17, 2026Updated last month
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
databricks / megablocks
View on GitHub
☆1,583Mar 25, 2026Updated 4 months ago
XueFuzhao / OpenMoE
View on GitHub
A family of open-sourced Mixture-of-Experts (MoE) Large Language Models
☆1,691Mar 8, 2024Updated 2 years ago
FranxYao / Long-Context-Data-Engineering
View on GitHub
Implementation of paper Data Engineering for Scaling Language Models to 128K Context
☆502Mar 19, 2024Updated 2 years ago
NVIDIA-NeMo / Curator
View on GitHub
Scalable data pre processing and curation toolkit for LLMs
☆1,683Updated this week
NVIDIA / Megatron-LM
View on GitHub
Ongoing research training transformer models at scale
☆17,212Updated this week
tatsu-lab / alpaca_eval
View on GitHub
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
☆2,007Aug 9, 2025Updated 11 months ago
huggingface / trl
View on GitHub
Train transformer language models with reinforcement learning.
☆18,927Updated this week
meta-pytorch / torchtune
View on GitHub
PyTorch native post-training library
☆5,786Updated this week
Dao-AILab / flash-attention
View on GitHub
Fast and memory-efficient exact attention
☆24,531Updated this week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
microsoft / rho
View on GitHub
Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.
☆471Apr 18, 2024Updated 2 years ago
OpenRLHF / OpenRLHF
View on GitHub
An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & VLM & TIS & vLLM & Ray & Asy…
☆9,848Jul 14, 2026Updated last week
mlfoundations / scaling
View on GitHub
Language models scale reliably with over-training and on downstream tasks
☆102Apr 2, 2024Updated 2 years ago
FasterDecoding / Medusa
View on GitHub
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
☆2,758Jun 25, 2024Updated 2 years ago
multimodal-art-projection / MAP-NEO
View on GitHub
☆985Feb 7, 2025Updated last year
jquesnelle / yarn
View on GitHub
YaRN: Efficient Context Window Extension of Large Language Models
☆1,740Apr 17, 2024Updated 2 years ago
bitsandbytes-foundation / bitsandbytes
View on GitHub
Accessible large language models via k-bit quantization for PyTorch.
☆8,338Updated this week