huggingface/cosmopedia

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/huggingface/cosmopedia)

huggingface / cosmopedia

☆572

Alternatives and similar repositories for cosmopedia

Users that are interested in cosmopedia are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

huggingface / llm-swarm
View on GitHub
Manage scalable open LLM inference endpoints in Slurm clusters
☆289Jul 11, 2024Updated 2 years ago
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,217Updated this week
huggingface / text-clustering
View on GitHub
Easily embed, cluster and semantically label text datasets
☆610Mar 28, 2024Updated 2 years ago
magpie-align / magpie
View on GitHub
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …
☆874Mar 17, 2025Updated last year
huggingface / nanotron
View on GitHub
Minimalistic large language model 3D-parallelism training
☆2,760May 26, 2026Updated last month
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
huggingface / lighteval
View on GitHub
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆2,488Jun 29, 2026Updated 3 weeks ago
argilla-io / distilabel
View on GitHub
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆3,339Updated this week
mlfoundations / dclm
View on GitHub
DataComp for Language Models
☆1,454Sep 9, 2025Updated 10 months ago
allenai / dolma
View on GitHub
Data and tools for generating and inspecting OLMo pre-training data.
☆1,526Nov 5, 2025Updated 8 months ago
huggingface / alignment-handbook
View on GitHub
Robust recipes to align language models with human and AI preferences
☆5,639May 26, 2026Updated last month
microsoft / rho
View on GitHub
Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.
☆470Apr 18, 2024Updated 2 years ago
huggingface / datablations
View on GitHub
Scaling Data-Constrained Language Models
☆344Jun 28, 2025Updated last year
TIGER-AI-Lab / MAmmoTH2
View on GitHub
Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
☆146Oct 27, 2024Updated last year
multimodal-art-projection / MAP-NEO
View on GitHub
☆985Feb 7, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
arcee-ai / mergekit
View on GitHub
Tools for merging pretrained large language models.
☆7,250Jun 17, 2026Updated last month
princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆204Dec 8, 2025Updated 7 months ago
lm-sys / llm-decontaminator
View on GitHub
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
☆324Dec 20, 2023Updated 2 years ago
hkust-nlp / deita
View on GitHub
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆599Dec 9, 2024Updated last year
NVIDIA / NeMo-Aligner
View on GitHub
Scalable toolkit for efficient model alignment
☆850Oct 6, 2025Updated 9 months ago
VikParuchuri / textbook_quality
View on GitHub
Generate textbook-quality synthetic LLM pretraining data
☆508Oct 19, 2023Updated 2 years ago
yifanzhang-pro / AutoMathText
View on GitHub
[ACL 2025 Findings] Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts (https://huggingface.co/papers…
☆92Nov 23, 2025Updated 7 months ago
tencent-ailab / persona-hub
View on GitHub
Official repo for the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas"
☆1,617Feb 19, 2025Updated last year
allenai / open-instruct
View on GitHub
AllenAI's post-training codebase
☆3,803Updated this week
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
sangmichaelxie / doremi
View on GitHub
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
☆357Dec 26, 2023Updated 2 years ago
ChenghaoMou / text-dedup
View on GitHub
All-in-one text de-duplication
☆764Mar 9, 2026Updated 4 months ago
GAIR-NLP / ProX
View on GitHub
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
☆271Jul 8, 2025Updated last year
EleutherAI / lm-evaluation-harness
View on GitHub
A framework for few-shot evaluation of language models.
☆13,359Jul 13, 2026Updated last week
google-research / deduplicate-text-datasets
View on GitHub
☆1,270Jul 30, 2024Updated last year
huggingface / Math-Verify
View on GitHub
☆1,170Jan 10, 2026Updated 6 months ago
p-lambda / dsir
View on GitHub
DSIR large-scale data selection framework for language model training
☆275Apr 7, 2024Updated 2 years ago
ZitongYang / Synthetic_Continued_Pretraining
View on GitHub
Code implementation of synthetic continued pretraining
☆162Jan 6, 2025Updated last year
allenai / wimbd
View on GitHub
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
☆229Nov 16, 2024Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
RUCAIBox / JiuZhang3.0
View on GitHub
The code and data for the paper JiuZhang3.0
☆49May 26, 2024Updated 2 years ago
allenai / OLMo
View on GitHub
Modeling, training, eval, and inference code for OLMo
☆6,600Nov 24, 2025Updated 7 months ago
yegcjs / mixinglaws
View on GitHub
☆113Jul 15, 2025Updated last year
jzhang38 / EasyContext
View on GitHub
Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.
☆759Sep 27, 2024Updated last year
mlfoundations / open_lm
View on GitHub
A repository for research on medium sized language models.
☆537Jun 6, 2025Updated last year
bigcode-project / bigcode-dataset
View on GitHub
☆497Aug 15, 2024Updated last year
togethercomputer / RedPajama-Data
View on GitHub
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
☆4,969Jun 3, 2026Updated last month