sangmichaelxie/doremi

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/sangmichaelxie/doremi)

sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets

☆357

Alternatives and similar repositories for doremi

Users that are interested in doremi are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

p-lambda / dsir
View on GitHub
DSIR large-scale data selection framework for language model training
☆275Apr 7, 2024Updated 2 years ago
princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆204Dec 8, 2025Updated 7 months ago
huggingface / datablations
View on GitHub
Scaling Data-Constrained Language Models
☆344Jun 28, 2025Updated last year
yegcjs / mixinglaws
View on GitHub
☆113Jul 15, 2025Updated last year
Olivia-fsm / DoGE
View on GitHub
Codebase for ICML submission "DOGE: Domain Reweighting with Generalization Estimation"
☆21Feb 29, 2024Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
sail-sg / regmix
View on GitHub
[ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
☆194Feb 17, 2025Updated last year
togethercomputer / RedPajama-Data
View on GitHub
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
☆4,972Jun 3, 2026Updated last month
shadowkiller33 / Contrast-Instruction
View on GitHub
☆19Oct 2, 2023Updated 2 years ago
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,223Updated this week
huggingface / cosmopedia
View on GitHub
☆572Nov 20, 2024Updated last year
lm-sys / llm-decontaminator
View on GitHub
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
☆325Dec 20, 2023Updated 2 years ago
huggingface / alignment-handbook
View on GitHub
Robust recipes to align language models with human and AI preferences
☆5,645May 26, 2026Updated 2 months ago
ChenghaoMou / text-dedup
View on GitHub
All-in-one text de-duplication
☆764Mar 9, 2026Updated 4 months ago
HazyResearch / skill-it
View on GitHub
Skill-It! A Data-Driven Skills Framework for Understanding and Training Language Models
☆48Oct 31, 2023Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
princeton-nlp / LLM-Shearing
View on GitHub
[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
☆643Mar 4, 2024Updated 2 years ago
hkust-nlp / deita
View on GitHub
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆600Dec 9, 2024Updated last year
IBM / SALMON
View on GitHub
Self-Alignment with Principle-Following Reward Models
☆170Sep 18, 2025Updated 10 months ago
XueFuzhao / OpenMoE
View on GitHub
A family of open-sourced Mixture-of-Experts (MoE) Large Language Models
☆1,691Mar 8, 2024Updated 2 years ago
princeton-nlp / LESS
View on GitHub
[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
☆532Oct 20, 2024Updated last year
daeveraert / gradient-information-optimization
View on GitHub
Implementation of Gradient Information Optimization (GIO) for effective and scalable training data selection
☆14Jun 22, 2023Updated 3 years ago
FranxYao / FlanT5-CoT-Specialization
View on GitHub
Implementation of ICML 23 Paper: Specializing Smaller Language Models towards Multi-Step Reasoning.
☆131Jun 18, 2023Updated 3 years ago
RulinShao / LightSeq
View on GitHub
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆223Aug 19, 2024Updated last year
MadryLab / DsDm
View on GitHub
☆53Jan 24, 2024Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
google-research / deduplicate-text-datasets
View on GitHub
☆1,269Jul 30, 2024Updated last year
cxcscmu / MATES
View on GitHub
Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
☆80Nov 14, 2024Updated last year
mlfoundations / scaling
View on GitHub
Language models scale reliably with over-training and on downstream tasks
☆102Apr 2, 2024Updated 2 years ago
CodeCreator / WebOrganizer
View on GitHub
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
☆83May 2, 2025Updated last year
facebookresearch / cc_net
View on GitHub
Tools to download and cleanup Common Crawl data
☆1,047Apr 25, 2023Updated 3 years ago
jquesnelle / yarn
View on GitHub
YaRN: Efficient Context Window Extension of Large Language Models
☆1,740Apr 17, 2024Updated 2 years ago
iwiwi / epochraft
View on GitHub
Checkpointable dataset utilities for foundation model training
☆32Jan 29, 2024Updated 2 years ago
huggingface / nanotron
View on GitHub
Minimalistic large language model 3D-parallelism training
☆2,766May 26, 2026Updated 2 months ago
FranxYao / Long-Context-Data-Engineering
View on GitHub
Implementation of paper Data Engineering for Scaling Language Models to 128K Context
☆502Mar 19, 2024Updated 2 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
da03 / criticize_text_generation
View on GitHub
A method for evaluating the high-level coherence of machine-generated texts. Identifies high-level coherence issues in transformer-based …
☆12Mar 18, 2023Updated 3 years ago
leuchine / self_play_picard
View on GitHub
Using self-play to augment multi-turn text-to-SQL datasets
☆12Oct 20, 2022Updated 3 years ago
commoncrawl / ia-web-commons
View on GitHub
Web archiving utility library
☆11Updated this week
openai / prm800k
View on GitHub
800,000 step-level correctness labels on LLM solutions to MATH problems
☆2,152Jun 1, 2023Updated 3 years ago
mlfoundations / dclm
View on GitHub
DataComp for Language Models
☆1,455Sep 9, 2025Updated 10 months ago
CarperAI / trlx
View on GitHub
A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
☆4,753Jan 8, 2024Updated 2 years ago
MadryLab / trak
View on GitHub
A fast, effective data attribution method for neural networks in PyTorch
☆243Nov 18, 2024Updated last year