SJTU-DMTai/awesome-ml-data-quality-papers

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/SJTU-DMTai/awesome-ml-data-quality-papers)

SJTU-DMTai / awesome-ml-data-quality-papers

Papers about training data quality management for ML models.

☆126

Alternatives and similar repositories for awesome-ml-data-quality-papers

Users that are interested in awesome-ml-data-quality-papers are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

reds-lab / projektor
View on GitHub
This is an official repository for "Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources" (…
☆14Oct 26, 2023Updated 2 years ago
cxcscmu / MATES
View on GitHub
Official repository for MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models [NeurIPS 2024]
☆80Nov 14, 2024Updated last year
hrtan / MoSo
View on GitHub
[NeurIPS-2023] The PyTorch Implementation of MoSo. The algorithms are based on our paper: "Data Pruning via Moving-one-Sample-out". MoSo …
☆10May 21, 2026Updated 2 months ago
ykwon0407 / DataInf
View on GitHub
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models (ICLR 2024)
☆82Oct 3, 2024Updated last year
daviddao / awesome-data-valuation
View on GitHub
💱 A curated list of data valuation (DV) to design your next data marketplace
☆143Feb 20, 2025Updated last year
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
ZaydH / influence_analysis_papers
View on GitHub
Influence Analysis and Estimation - Survey, Papers, and Taxonomy
☆90Feb 27, 2024Updated 2 years ago
alon-albalak / data-selection-survey
View on GitHub
A Survey on Data Selection for Language Models
☆261Apr 29, 2025Updated last year
anshuman23 / InfDataSel
View on GitHub
Code for paper: “What Data Benefits My Classifier?” Enhancing Model Performance and Interpretability through Influence-Based Data Selecti…
☆23May 17, 2024Updated 2 years ago
adymaharana / d2pruning
View on GitHub
☆44Oct 13, 2023Updated 2 years ago
SJTU-DMTai / Data-Management-for-GNN-Training
View on GitHub
☆11Sep 6, 2024Updated last year
JLDeng / SSCNN
View on GitHub
[NeurIPS 2024 Spotlight] Official Code of the paper "Parsimony or Capability? Decomposition Delivers Both in Long-term Time Series Foreca…
☆16Dec 24, 2024Updated last year
simplelifetime / TIVE
View on GitHub
Less is More: High-value Data Selection for Visual Instruction Tuning
☆20Jan 18, 2025Updated last year
reds-lab / LAVA
View on GitHub
This is an official repository for "LAVA: Data Valuation without Pre-Specified Learning Algorithms" (ICLR2023).
☆54Jun 5, 2024Updated 2 years ago
MorphingDB / MorphingDB
View on GitHub
PostgreSQL extension for supporting deep learning model inference within the database and vector storage
☆63Sep 29, 2025Updated 10 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
TRAIS-Lab / dattri
View on GitHub
`dattri` is a PyTorch library for developing, benchmarking, and deploying efficient data attribution algorithms.
☆124Mar 24, 2026Updated 4 months ago
rgeirhos / dataset-pruning-metrics
View on GitHub
Metrics for "Beyond neural scaling laws: beating power law scaling via data pruning " (NeurIPS 2022 Outstanding Paper Award)
☆58Apr 24, 2023Updated 3 years ago
ZifanL / TSDS
View on GitHub
Implementation of TSDS: Data Selection for Task-Specific Model Finetuning. An optimal-transport framework for selecting domain-specific a…
☆19Dec 25, 2024Updated last year
SJTU-DMTai / SUNNY-GNN
View on GitHub
The official implementation of AAAI'24 paper: Self-Interpretable Graph Learning with Sufficient and Necessary Explanations.
☆16Jan 29, 2024Updated 2 years ago
XinyiYS / Gradient-Driven-Rewards-to-Guarantee-Fairness-in-Collaborative-Machine-Learning
View on GitHub
Official code repository for our accepted work "Gradient Driven Rewards to Guarantee Fairness in Collaborative Machine Learning" in NeurI…
☆28Sep 28, 2024Updated last year
opendataval / opendataval
View on GitHub
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)
☆101Feb 4, 2025Updated last year
zhangxin-xd / Dataset-Pruning-TDDS
View on GitHub
The official implementation of paper "Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning" （CVPR …
☆21Aug 20, 2024Updated last year
OPTML-Group / DP4TL
View on GitHub
[NeurIPS2023] "Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced Transfer Learning" by Yihua Zhang*, Yimeng Zhang*,…
☆14Oct 12, 2023Updated 2 years ago
ekinakyurek / influence
View on GitHub
Code for "Tracing Knowledge in Language Models Back to the Training Data"
☆40Dec 27, 2022Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
logix-project / logix
View on GitHub
AI Logging for Interpretability and Explainability🔬
☆147Jun 7, 2024Updated 2 years ago
MadryLab / DsDm
View on GitHub
☆53Jan 24, 2024Updated 2 years ago
TianyuFan0504 / awesome-spatio-temporal-graph
View on GitHub
This repository contains a list of papers on spatio-temporal graph, especially about GNNs on S-T graph.
☆18Sep 8, 2023Updated 2 years ago
sungyubkim / gex
View on GitHub
Official code implementation of "GEX: A flexible method for approximating influence via Geometric Ensemble" (NeurIPS 2023)
☆14Jan 3, 2024Updated 2 years ago
NUS-HPC-AI-Lab / DD-Ranking
View on GitHub
Data distillation benchmark
☆73Jun 13, 2025Updated last year
hanshen95 / SEAL
View on GitHub
An implementation of SEAL: Safety-Enhanced Aligned LLM fine-tuning via bilevel data selection.
☆24Feb 20, 2025Updated last year
CodeCreator / WebOrganizer
View on GitHub
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
☆83May 2, 2025Updated last year
daeveraert / gradient-information-optimization
View on GitHub
Implementation of Gradient Information Optimization (GIO) for effective and scalable training data selection
☆14Jun 22, 2023Updated 3 years ago
yidingjiang / ado
View on GitHub
The repository contains code for Adaptive Data Optimization
☆37Dec 9, 2024Updated last year
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
zepingyu0512 / neuron-attribution
View on GitHub
code for EMNLP 2024 paper: Neuron-Level Knowledge Attribution in Large Language Models
☆52Nov 17, 2024Updated last year
p-lambda / dsir
View on GitHub
DSIR large-scale data selection framework for language model training
☆275Apr 7, 2024Updated 2 years ago
princeton-nlp / QuRating
View on GitHub
[ICML 2024] Selecting High-Quality Data for Training Language Models
☆204Dec 8, 2025Updated 7 months ago
google-research / jax-influence
View on GitHub
☆66Jan 13, 2022Updated 4 years ago
jjbrophy47 / instance_based_interpretability
View on GitHub
Existing literature about training-data analysis.
☆17Dec 17, 2021Updated 4 years ago
BAAI-DCAI / Dataset-Pruning
View on GitHub
Dataset pruning for ImageNet and LAION-2B.
☆80Jul 5, 2024Updated 2 years ago
sail-sg / regmix
View on GitHub
[ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
☆195Feb 17, 2025Updated last year