bigscience-workshop/data-preparation

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/bigscience-workshop/data-preparation)

bigscience-workshop / data-preparation

Code used for sourcing and cleaning the BigScience ROOTS corpus

☆318

Alternatives and similar repositories for data-preparation

Users that are interested in data-preparation are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

bigscience-workshop / data_tooling
View on GitHub
Tools for managing datasets for governance and training.
☆91May 25, 2026Updated last month
ChenghaoMou / text-dedup
View on GitHub
All-in-one text de-duplication
☆764Mar 9, 2026Updated 4 months ago
oscar-project / ungoliant
View on GitHub
The pipeline for the OSCAR corpus
☆178Nov 9, 2025Updated 8 months ago
bigcode-project / bigcode-dataset
View on GitHub
☆496Aug 15, 2024Updated last year
google-research / deduplicate-text-datasets
View on GitHub
☆1,270Jul 30, 2024Updated last year
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
facebookresearch / cc_net
View on GitHub
Tools to download and cleanup Common Crawl data
☆1,045Apr 25, 2023Updated 3 years ago
bigscience-workshop / Megatron-DeepSpeed
View on GitHub
Ongoing research training transformer language models at scale, including: BERT & GPT-2
☆1,448Mar 20, 2024Updated 2 years ago
huggingface / olm-datasets
View on GitHub
Pipeline for pulling and processing online language model pretraining data from the web
☆179Jul 31, 2023Updated 2 years ago
bigscience-workshop / bigscience
View on GitHub
Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.
☆1,018Jul 29, 2024Updated last year
shjwudp / c4-dataset-script
View on GitHub
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese…
☆136Jun 7, 2023Updated 3 years ago
bigscience-workshop / metadata
View on GitHub
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
☆29Jun 12, 2023Updated 3 years ago
huggingface / datablations
View on GitHub
Scaling Data-Constrained Language Models
☆344Jun 28, 2025Updated last year
togethercomputer / RedPajama-Data
View on GitHub
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
☆4,969Jun 3, 2026Updated last month
allenai / dolma
View on GitHub
Data and tools for generating and inspecting OLMo pre-training data.
☆1,526Nov 5, 2025Updated 8 months ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
bigcode-project / bigcode-analysis
View on GitHub
Repository for analysis and experiments in the BigCode project.
☆126Mar 20, 2024Updated 2 years ago
bigscience-workshop / promptsource
View on GitHub
Toolkit for creating, sharing and using natural language prompts.
☆3,027Oct 23, 2023Updated 2 years ago
EleutherAI / the-pile
View on GitHub
☆1,670Apr 27, 2023Updated 3 years ago
NVIDIA / Megatron-LM
View on GitHub
Ongoing research training transformer models at scale
☆17,108Updated this week
p-lambda / dsir
View on GitHub
DSIR large-scale data selection framework for language model training
☆275Apr 7, 2024Updated 2 years ago
EleutherAI / lm-evaluation-harness
View on GitHub
A framework for few-shot evaluation of language models.
☆13,336Jul 13, 2026Updated last week
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,210Updated this week
esbatmop / MNBVC
View on GitHub
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志…
☆4,245Jul 13, 2026Updated last week
yizhongw / self-instruct
View on GitHub
Aligning pretrained language models with instruction data generated by themselves.
☆4,606Mar 27, 2023Updated 3 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
deepspeedai / Megatron-DeepSpeed
View on GitHub
Ongoing research training transformer language models at scale, including: BERT & GPT-2
☆2,256Aug 14, 2025Updated 11 months ago
ekzhu / datasketch
View on GitHub
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
☆2,944Jul 5, 2026Updated 2 weeks ago
CarperAI / trlx
View on GitHub
A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
☆4,752Jan 8, 2024Updated 2 years ago
locuslab / scaling_laws_data_filtering
View on GitHub
☆64Apr 9, 2024Updated 2 years ago
EleutherAI / oslo
View on GitHub
OSLO: Open Source for Large-scale Optimization
☆175Sep 9, 2023Updated 2 years ago
bigscience-workshop / multilingual-modeling
View on GitHub
BLOOM+1: Adapting BLOOM model to support a new unseen language
☆74Mar 2, 2024Updated 2 years ago
bigcode-project / bigcode-tokenizer
View on GitHub
☆15Oct 24, 2023Updated 2 years ago
FranxYao / chain-of-thought-hub
View on GitHub
Benchmarking large language models' complex reasoning ability with chain-of-thought prompting
☆2,777Aug 4, 2024Updated last year
anthropics / hh-rlhf
View on GitHub
Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
☆1,851Jun 17, 2025Updated last year
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
leogao2 / lm_dataformat
View on GitHub
☆79Dec 7, 2023Updated 2 years ago
databricks / megablocks
View on GitHub
☆1,582Mar 25, 2026Updated 3 months ago
PhoebusSi / Alpaca-CoT
View on GitHub
We unified the interfaces of instruction-tuning data (e.g., CoT data), multiple LLMs and parameter-efficient methods (e.g., lora, p-tunin…
☆2,791Dec 12, 2023Updated 2 years ago
chatnoir-eu / chatnoir-resiliparse
View on GitHub
A robust web archive analytics toolkit
☆144Jun 16, 2026Updated last month
sangmichaelxie / doremi
View on GitHub
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
☆357Dec 26, 2023Updated 2 years ago
huggingface / trl
View on GitHub
Train transformer language models with reinforcement learning.
☆18,878Updated this week
huggingface / nanotron
View on GitHub
Minimalistic large language model 3D-parallelism training
☆2,754May 26, 2026Updated last month