CodeCreator/WebOrganizer

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/CodeCreator/WebOrganizer)

CodeCreator / WebOrganizer

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

☆83

Alternatives and similar repositories for WebOrganizer

Users that are interested in WebOrganizer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

allenai / olmix
View on GitHub
☆41May 26, 2026Updated last month
Jiachen-T-Wang / GREATS
View on GitHub
☆20Jun 27, 2026Updated 3 weeks ago
cxcscmu / MATES
View on GitHub
Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
☆80Nov 14, 2024Updated last year
HazyResearch / aioli
View on GitHub
Aioli: A unified optimization framework for language model data mixing
☆33Jan 17, 2025Updated last year
hkust-nlp / PreSelect
View on GitHub
[ICML 2025] Predictive Data Selection: The Data That Predicts Is the Data That Teaches
☆66Mar 4, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆204Dec 8, 2025Updated 7 months ago
sail-sg / regmix
View on GitHub
[ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
☆194Feb 17, 2025Updated last year
allenai / datamap-rs
View on GitHub
Data mapping framework for rust stuff
☆56Mar 25, 2026Updated 3 months ago
feiyang-k / AutoScale
View on GitHub
Official Code Repository for [AutoScale📈: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*…
☆14Aug 8, 2025Updated 11 months ago
princeton-pli / MeCo
View on GitHub
Code for ICML 25 paper "Metadata Conditioning Accelerates Language Model Pre-training (MeCo)"
☆51Jun 30, 2025Updated last year
allenai / olmes
View on GitHub
Reproducible, flexible LLM evaluations
☆388Mar 24, 2026Updated 3 months ago
fjzzq2002 / WeightWatch
View on GitHub
Official Repository of Paper "Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs"
☆15Sep 25, 2025Updated 9 months ago
GAIR-NLP / OctoThinker
View on GitHub
Revisiting Mid-training in the Era of Reinforcement Learning Scaling
☆189Jul 23, 2025Updated 11 months ago
mlfoundations / dclm
View on GitHub
DataComp for Language Models
☆1,454Sep 9, 2025Updated 10 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
sail-sg / SkyLadder
View on GitHub
The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling
☆43Dec 29, 2025Updated 6 months ago
mlfoundations / scaling
View on GitHub
Language models scale reliably with over-training and on downstream tasks
☆102Apr 2, 2024Updated 2 years ago
MadryLab / D3M
View on GitHub
Debiasing Through Data Attribution
☆13May 23, 2024Updated 2 years ago
allenai / bff
View on GitHub
☆39Apr 17, 2024Updated 2 years ago
chentong0 / copy-bench
View on GitHub
CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation
☆14Aug 19, 2025Updated 11 months ago
mlfoundations / dataset2metadata
View on GitHub
☆28Mar 21, 2024Updated 2 years ago
MadryLab / DsDm
View on GitHub
☆53Jan 24, 2024Updated 2 years ago
kakao / kanana-2
View on GitHub
☆23Jun 30, 2026Updated 3 weeks ago
allenai / wimbd
View on GitHub
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
☆229Nov 16, 2024Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
davanstrien / huggingface-tldr
View on GitHub
Experimental tl;dr summaries for datasets on the Hugging Face Hub!
☆10Apr 4, 2024Updated 2 years ago
rioyokotalab / swallow-code-math
View on GitHub
Ongoing research project for code&math LLMs
☆32Jul 4, 2025Updated last year
EleutherAI / nanoGPT-mup
View on GitHub
The simplest, fastest repository for training/finetuning medium-sized GPTs.
☆199Jan 19, 2026Updated 6 months ago
hyunwoongko / stop-sequencer
View on GitHub
Implementation of stop sequencer for Huggingface Transformers
☆16Jun 6, 2023Updated 3 years ago
alon-albalak / data-selection-survey
View on GitHub
A Survey on Data Selection for Language Models
☆260Apr 29, 2025Updated last year
SalesforceAIResearch / PretrainRL-pipeline
View on GitHub
An automated data pipeline scaling RL to pretraining levels
☆76Jun 2, 2026Updated last month
p-lambda / dsir
View on GitHub
DSIR large-scale data selection framework for language model training
☆275Apr 7, 2024Updated 2 years ago
harishsg993010 / HawkinsRAG
View on GitHub
☆20Feb 18, 2025Updated last year
allenai / duplodocus
View on GitHub
Tooling for exact and MinHash deduplication of large-scale text datasets
☆90Mar 24, 2026Updated 3 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
MrZilinXiao / ProxyThinker
View on GitHub
[ICLR 2026] Official Implementation of ProxyThinker: Test-Time Guidance through Small Visual Reasoners.
☆22Sep 24, 2025Updated 9 months ago
Joshuaclymer / GameBench
View on GitHub
☆21Jun 27, 2024Updated 2 years ago
adymaharana / d2pruning
View on GitHub
☆44Oct 13, 2023Updated 2 years ago
richardodliu / OpenCodeEval
View on GitHub
☆52Mar 9, 2026Updated 4 months ago
hrtan / MoSo
View on GitHub
[NeurIPS-2023] The PyTorch Implementation of MoSo. The algorithms are based on our paper: "Data Pruning via Moving-one-Sample-out". MoSo …
☆10May 21, 2026Updated 2 months ago
allenai / FlexOlmo
View on GitHub
Code and training scripts for FlexOlmo
☆151Apr 20, 2026Updated 3 months ago
sangmichaelxie / doremi
View on GitHub
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
☆357Dec 26, 2023Updated 2 years ago