Scaling Data-Constrained Language Models
☆342Jun 28, 2025Updated 8 months ago
Alternatives and similar repositories for datablations
Users that are interested in datablations are comparing it to the libraries listed below
Sorting:
- Language models scale reliably with over-training and on downstream tasks☆100Apr 2, 2024Updated last year
- Minimalistic large language model 3D-parallelism training☆2,569Feb 19, 2026Updated last week
- Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets☆350Dec 26, 2023Updated 2 years ago
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]☆147Sep 20, 2024Updated last year
- Data and tools for generating and inspecting OLMo pre-training data.☆1,411Nov 5, 2025Updated 3 months ago
- Pile Deduplication Code☆18May 15, 2023Updated 2 years ago
- ☆109Jul 15, 2025Updated 7 months ago
- A repository for research on medium sized language models.☆533Jun 6, 2025Updated 8 months ago
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,903Updated this week
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆486Mar 19, 2024Updated last year
- ☆565Nov 20, 2024Updated last year
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends☆2,311Feb 20, 2026Updated last week
- Robust recipes to align language models with human and AI preferences☆5,506Sep 8, 2025Updated 5 months ago
- Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M d…☆211Aug 28, 2024Updated last year
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding☆1,315Mar 6, 2025Updated 11 months ago
- Code for the paper "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns"☆18Mar 15, 2024Updated last year
- Hugging Face and Pyserini interoperability☆19May 18, 2023Updated 2 years ago
- An open collection of implementation tips, tricks and resources for training large language models☆497Mar 8, 2023Updated 2 years ago
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆316Dec 20, 2023Updated 2 years ago
- Benchmarking large language models' complex reasoning ability with chain-of-thought prompting☆2,766Aug 4, 2024Updated last year
- Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.☆459Apr 18, 2024Updated last year
- 800,000 step-level correctness labels on LLM solutions to MATH problems☆2,091Jun 1, 2023Updated 2 years ago
- Skill-It! A Data-Driven Skills Framework for Understanding and Training Language Models☆48Oct 31, 2023Updated 2 years ago
- A family of open-sourced Mixture-of-Experts (MoE) Large Language Models☆1,660Mar 8, 2024Updated last year
- Salesforce open-source LLMs with 8k sequence length.☆725Jan 31, 2025Updated last year
- [NeurIPS 2023] MeZO: Fine-Tuning Language Models with Just Forward Passes. https://arxiv.org/abs/2305.17333☆1,149Jan 11, 2024Updated 2 years ago
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆96Feb 9, 2023Updated 3 years ago
- Fine-Tuning Pre-trained Transformers into Decaying Fast Weights☆19Oct 9, 2022Updated 3 years ago
- DataComp for Language Models☆1,419Sep 9, 2025Updated 5 months ago
- Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"☆1,816Jun 17, 2025Updated 8 months ago
- ☆1,560Feb 20, 2026Updated last week
- The RedPajama-Data repository contains code for preparing large datasets for training large language models.☆4,923Dec 7, 2024Updated last year
- ☆2,549May 19, 2024Updated last year
- maximal update parametrization (µP)☆1,686Jul 17, 2024Updated last year
- A simulation framework for RLHF and alternatives. Develop your RLHF method without collecting human data.☆842Jul 1, 2024Updated last year
- [EMNLP 2022] Training Language Models with Memory Augmentation https://arxiv.org/abs/2205.12674☆195Jun 14, 2023Updated 2 years ago
- ☆99Jul 25, 2023Updated 2 years ago
- The hub for EleutherAI's work on interpretability and learning dynamics☆2,740Nov 15, 2025Updated 3 months ago
- AllenAI's post-training codebase☆3,592Updated this week