huggingface/datatrove

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/huggingface/datatrove)

huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

☆3,219

Alternatives and similar repositories for datatrove

Users that are interested in datatrove are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

huggingface / nanotron
View on GitHub
Minimalistic large language model 3D-parallelism training
☆2,761May 26, 2026Updated last month
huggingface / lighteval
View on GitHub
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆2,493Jun 29, 2026Updated 3 weeks ago
huggingface / cosmopedia
View on GitHub
☆572Nov 20, 2024Updated last year
allenai / dolma
View on GitHub
Data and tools for generating and inspecting OLMo pre-training data.
☆1,526Nov 5, 2025Updated 8 months ago
huggingface / alignment-handbook
View on GitHub
Robust recipes to align language models with human and AI preferences
☆5,641May 26, 2026Updated last month
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
argilla-io / distilabel
View on GitHub
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆3,341Updated this week
arcee-ai / mergekit
View on GitHub
Tools for merging pretrained large language models.
☆7,254Jun 17, 2026Updated last month
ChenghaoMou / text-dedup
View on GitHub
All-in-one text de-duplication
☆764Mar 9, 2026Updated 4 months ago
huggingface / text-clustering
View on GitHub
Easily embed, cluster and semantically label text datasets
☆610Mar 28, 2024Updated 2 years ago
axolotl-ai-cloud / axolotl
View on GitHub
Go ahead and axolotl questions
☆12,232Updated this week
EleutherAI / lm-evaluation-harness
View on GitHub
A framework for few-shot evaluation of language models.
☆13,359Jul 13, 2026Updated last week
huggingface / trl
View on GitHub
Train transformer language models with reinforcement learning.
☆18,906Updated this week
databricks / lilac
View on GitHub
Curate better data for LLMs
☆1,072Mar 19, 2024Updated 2 years ago
meta-pytorch / torchtune
View on GitHub
PyTorch native post-training library
☆5,786Updated this week
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
google-research / deduplicate-text-datasets
View on GitHub
☆1,270Jul 30, 2024Updated last year
huggingface / text-generation-inference
View on GitHub
Large Language Model Text Generation Inference
☆10,880Mar 21, 2026Updated 4 months ago
mlfoundations / dclm
View on GitHub
DataComp for Language Models
☆1,455Sep 9, 2025Updated 10 months ago
pytorch / torchtitan
View on GitHub
A PyTorch native platform for training generative AI models
☆5,552Updated this week
databricks / megablocks
View on GitHub
☆1,582Mar 25, 2026Updated 3 months ago
huggingface / llm-swarm
View on GitHub
Manage scalable open LLM inference endpoints in Slurm clusters
☆289Jul 11, 2024Updated 2 years ago
togethercomputer / RedPajama-Data
View on GitHub
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
☆4,970Jun 3, 2026Updated last month
NVIDIA-NeMo / Curator
View on GitHub
Scalable data pre processing and curation toolkit for LLMs
☆1,674Updated this week
bitsandbytes-foundation / bitsandbytes
View on GitHub
Accessible large language models via k-bit quantization for PyTorch.
☆8,337Updated this week
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
mosaicml / llm-foundry
View on GitHub
LLM training code for Databricks foundation models
☆4,431Mar 25, 2026Updated 3 months ago
linkedin / Liger-Kernel
View on GitHub
Efficient Triton Kernels for LLM Training
☆6,531Updated this week
allenai / open-instruct
View on GitHub
AllenAI's post-training codebase
☆3,805Updated this week
allenai / wimbd
View on GitHub
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
☆229Nov 16, 2024Updated last year
huggingface / peft
View on GitHub
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
☆21,433Updated this week
facebookresearch / lingua
View on GitHub
Meta Lingua: a lean, efficient, and easy-to-hack codebase to research LLMs.
☆4,755Jul 18, 2025Updated last year
allenai / OLMo
View on GitHub
Modeling, training, eval, and inference code for OLMo
☆6,600Nov 24, 2025Updated 7 months ago
meta-pytorch / gpt-fast
View on GitHub
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
☆6,229Aug 22, 2025Updated 11 months ago
dottxt-ai / outlines
View on GitHub
Structured Outputs
☆15,101Updated this week
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
AnswerDotAI / RAGatouille
View on GitHub
Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-…
☆3,941May 17, 2025Updated last year
argilla-io / argilla
View on GitHub
Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
☆5,044Updated this week
Dao-AILab / flash-attention
View on GitHub
Fast and memory-efficient exact attention
☆24,502Updated this week
NVIDIA / NeMo-Aligner
View on GitHub
Scalable toolkit for efficient model alignment
☆850Oct 6, 2025Updated 9 months ago
CarperAI / trlx
View on GitHub
A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
☆4,753Jan 8, 2024Updated 2 years ago
huggingface / datablations
View on GitHub
Scaling Data-Constrained Language Models
☆344Jun 28, 2025Updated last year
stanfordnlp / dspy
View on GitHub
DSPy: The framework for programming—not prompting—language models
☆36,306Updated this week