mlfoundations/scaling

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/mlfoundations/scaling)

mlfoundations / scaling

Language models scale reliably with over-training and on downstream tasks

☆102

Alternatives and similar repositories for scaling

Users that are interested in scaling are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

mlfoundations / open_lm
View on GitHub
A repository for research on medium sized language models.
☆537Jun 6, 2025Updated last year
allenai / easy-to-hard-generalization
View on GitHub
Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"
☆48Jan 17, 2024Updated 2 years ago
huggingface / datablations
View on GitHub
Scaling Data-Constrained Language Models
☆344Jun 28, 2025Updated last year
formll / resolving-scaling-law-discrepancies
View on GitHub
☆19Nov 4, 2025Updated 8 months ago
PiotrNawrot / nano-sparse-attention
View on GitHub
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆92Jul 17, 2025Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
frankxu2004 / knnlm-why
View on GitHub
Repo for ICML23 "Why do Nearest Neighbor Language Models Work?"
☆59Jan 12, 2023Updated 3 years ago
ryoungj / ObsScaling
View on GitHub
[NeurIPS'24 Spotlight] Observational Scaling Laws
☆60Oct 2, 2024Updated last year
GAIR-NLP / ProX
View on GitHub
[ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
☆271Jul 8, 2025Updated last year
EleutherAI / pile_dedupe
View on GitHub
Pile Deduplication Code
☆18May 15, 2023Updated 3 years ago
allenai / wimbd
View on GitHub
What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets
☆229Nov 16, 2024Updated last year
CodeCreator / WebOrganizer
View on GitHub
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
☆83May 2, 2025Updated last year
Zyphra / Zyda_processing
View on GitHub
☆44Jun 19, 2024Updated 2 years ago
mlfoundations / dclm
View on GitHub
DataComp for Language Models
☆1,455Sep 9, 2025Updated 10 months ago
crowsonkb / torch-dist-utils
View on GitHub
Utilities for PyTorch distributed
☆26Feb 27, 2025Updated last year
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
shawntan / scattermoe
View on GitHub
Triton-based implementation of Sparse Mixture of Experts.
☆281Oct 3, 2025Updated 9 months ago
sail-sg / scaling-with-vocab
View on GitHub
[NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623
☆112Sep 26, 2024Updated last year
ethancaballero / broken_neural_scaling_laws
View on GitHub
Code Release for "Broken Neural Scaling Laws" (BNSL) paper
☆59Oct 29, 2023Updated 2 years ago
allenai / catwalk
View on GitHub
This project studies the performance and robustness of language models and task-adaptation methods.
☆154May 18, 2024Updated 2 years ago
nkandpa2 / long_tail_knowledge
View on GitHub
Repo for the paper "Large Language Models Struggle to Learn Long-Tail Knowledge"
☆77Apr 12, 2023Updated 3 years ago
sail-sg / ActivePRM
View on GitHub
☆21Apr 16, 2025Updated last year
hkust-nlp / llm-compression-intelligence
View on GitHub
Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]
☆150Sep 20, 2024Updated last year
RUCAIBox / JiuZhang3.0
View on GitHub
The code and data for the paper JiuZhang3.0
☆49May 26, 2024Updated 2 years ago
allenai / dolma
View on GitHub
Data and tools for generating and inspecting OLMo pre-training data.
☆1,527Nov 5, 2025Updated 8 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
UriSha / EmbeddinglessNMT
View on GitHub
The implementation of "Neural Machine Translation without Embeddings", NAACL 2021
☆33Jun 9, 2021Updated 5 years ago
cloneofsimo / ezmup
View on GitHub
Simple implementation of muP, based on Spectral Condition for Feature Learning. The implementation is SGD only, dont use it for Adam
☆88Jul 28, 2024Updated last year
sail-sg / regmix
View on GitHub
[ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
☆194Feb 17, 2025Updated last year
protagolabs / odyssey-math
View on GitHub
☆84Jan 25, 2025Updated last year
llm-random / llm-random
View on GitHub
☆212Jun 17, 2026Updated last month
bigscience-workshop / multilingual-modeling
View on GitHub
BLOOM+1: Adapting BLOOM model to support a new unseen language
☆75Mar 2, 2024Updated 2 years ago
ryanzhumich / sparc_atis_pytorch
View on GitHub
☆10Oct 28, 2019Updated 6 years ago
EleutherAI / improved-t5
View on GitHub
Experiments for efforts to train a new and improved t5
☆76Apr 15, 2024Updated 2 years ago
alon-albalak / data-selection-survey
View on GitHub
A Survey on Data Selection for Language Models
☆261Apr 29, 2025Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
allenai / gpv2-web10k
View on GitHub
Download Web-10K data by querying Bing Image Search
☆10Feb 1, 2022Updated 4 years ago
princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆204Dec 8, 2025Updated 7 months ago
yifanzhang-pro / AutoMathText
View on GitHub
[ACL 2025 Findings] Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts (https://huggingface.co/papers…
☆92Nov 23, 2025Updated 8 months ago
allenai / mmc4
View on GitHub
MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
☆953Mar 19, 2025Updated last year
ggjy / vision_weak_to_strong
View on GitHub
☆38Feb 8, 2024Updated 2 years ago
lm-sys / llm-decontaminator
View on GitHub
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
☆325Dec 20, 2023Updated 2 years ago
vwxyzjn / lm-human-preference-details
View on GitHub
RLHF implementation details of OAI's 2019 codebase
☆198Jan 14, 2024Updated 2 years ago