p-lambda/dsir

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/p-lambda/dsir)

p-lambda / dsir

DSIR large-scale data selection framework for language model training

☆275

Alternatives and similar repositories for dsir

Users that are interested in dsir are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆204Dec 8, 2025Updated 7 months ago
sangmichaelxie / doremi
View on GitHub
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
☆357Dec 26, 2023Updated 2 years ago
princeton-nlp / LESS
View on GitHub
[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning
☆531Oct 20, 2024Updated last year
cxcscmu / MATES
View on GitHub
Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
☆80Nov 14, 2024Updated last year
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,214Updated this week
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
alon-albalak / online-data-mixing
View on GitHub
An implementation of online data mixing for the Pile dataset, based on the GPT-NeoX library.
☆14Jan 9, 2024Updated 2 years ago
facebookresearch / SemDeDup
View on GitHub
Code for "SemDeDup", a simple method for identifying and removing semantic duplicates from a dataset (data pairs which are semantically s…
☆156Oct 1, 2023Updated 2 years ago
hkust-nlp / deita
View on GitHub
Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
☆599Dec 9, 2024Updated last year
google-research / deduplicate-text-datasets
View on GitHub
☆1,270Jul 30, 2024Updated last year
allenai / olmix
View on GitHub
☆41May 26, 2026Updated last month
alon-albalak / data-selection-survey
View on GitHub
A Survey on Data Selection for Language Models
☆260Apr 29, 2025Updated last year
princeton-nlp / LLM-Shearing
View on GitHub
[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
☆640Mar 4, 2024Updated 2 years ago
feiyang-k / AutoScale
View on GitHub
Official Code Repository for [AutoScale📈: Scale-Aware Data Mixing for Pre-Training LLMs] Published as a conference paper at **COLM 2025*…
☆14Aug 8, 2025Updated 11 months ago
huggingface / cosmopedia
View on GitHub
☆572Nov 20, 2024Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
swj0419 / detect-pretrain-code
View on GitHub
This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Aji…
☆243Nov 3, 2023Updated 2 years ago
CodeCreator / WebOrganizer
View on GitHub
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
☆83May 2, 2025Updated last year
MadryLab / datamodels-data
View on GitHub
Data for "Datamodels: Predicting Predictions with Training Data"
☆97May 25, 2023Updated 3 years ago
ZigeW / data_management_LLM
View on GitHub
Collection of training data management explorations for large language models
☆342Aug 2, 2024Updated last year
itayle / diverse-demonstrations
View on GitHub
Diverse Demonstrations Improve In-context Compositional Generalization
☆12Jul 7, 2023Updated 3 years ago
tianyi-lab / Cherry_LLM
View on GitHub
[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other mo…
☆416Jun 25, 2025Updated last year
ZifanL / TSDS
View on GitHub
Implementation of TSDS: Data Selection for Task-Specific Model Finetuning. An optimal-transport framework for selecting domain-specific a…
☆19Dec 25, 2024Updated last year
allenai / easy-to-hard-generalization
View on GitHub
Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"
☆48Jan 17, 2024Updated 2 years ago
allenai / dolma
View on GitHub
Data and tools for generating and inspecting OLMo pre-training data.
☆1,526Nov 5, 2025Updated 8 months ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
swj0419 / in-context-pretraining
View on GitHub
☆57Apr 11, 2024Updated 2 years ago
MadryLab / DsDm
View on GitHub
☆53Jan 24, 2024Updated 2 years ago
tianyi-lab / Superfiltering
View on GitHub
[ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning
☆189Jun 25, 2025Updated last year
huggingface / datablations
View on GitHub
Scaling Data-Constrained Language Models
☆344Jun 28, 2025Updated last year
applicaai / CCpdf
View on GitHub
Index of URLs to pdf files all over the internet and scripts
☆25May 2, 2023Updated 3 years ago
chatnoir-eu / chatnoir-resiliparse
View on GitHub
A robust web archive analytics toolkit
☆144Updated this week
mlfoundations / dclm
View on GitHub
DataComp for Language Models
☆1,454Sep 9, 2025Updated 10 months ago
logix-project / logix
View on GitHub
AI Logging for Interpretability and Explainability🔬
☆147Jun 7, 2024Updated 2 years ago
huggingface / nanotron
View on GitHub
Minimalistic large language model 3D-parallelism training
☆2,755May 26, 2026Updated last month
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
locuslab / scaling_laws_data_filtering
View on GitHub
☆64Apr 9, 2024Updated 2 years ago
meta-math / MetaMath
View on GitHub
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
☆455Feb 1, 2024Updated 2 years ago
microsoft / rho
View on GitHub
Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.
☆470Apr 18, 2024Updated 2 years ago
mlfoundations / scaling
View on GitHub
Language models scale reliably with over-training and on downstream tasks
☆102Apr 2, 2024Updated 2 years ago
bigscience-workshop / data-preparation
View on GitHub
Code used for sourcing and cleaning the BigScience ROOTS corpus
☆318Mar 20, 2023Updated 3 years ago
IBM / ModuleFormer
View on GitHub
ModuleFormer is a MoE-based architecture that includes two different types of experts: stick-breaking attention heads and feedforward exp…
☆225Sep 18, 2025Updated 10 months ago
MadryLab / trak
View on GitHub
A fast, effective data attribution method for neural networks in PyTorch
☆243Nov 18, 2024Updated last year