chu-data-lab/CleanML

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/chu-data-lab/CleanML)

chu-data-lab / CleanML

A Benchmark for Joint Data Cleaning and Machine Learning

☆50

Alternatives and similar repositories for CleanML

Users that are interested in CleanML are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

WelkinNi / Automatic-Data-Repair
View on GitHub
☆15Mar 6, 2025Updated last year
mohamedyd / rein-benchmark
View on GitHub
A comprehensive benchmark for data cleaning methods and their impact of ML models
☆16Jul 24, 2024Updated 2 years ago
sis-ethz / Picket
View on GitHub
Picket is a system that safeguards against data corruptions during both training and deployment of machine learning models over tabular d…
☆14Nov 24, 2020Updated 5 years ago
clips / interpret_with_rules
View on GitHub
Code for the paper "Rule induction for global explanation of trained models"
☆22Jul 25, 2024Updated 2 years ago
JunHao-Zhu / FusionQuery
View on GitHub
[VLDB 2024] Source code for FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous Data
☆11Mar 11, 2025Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
LaureBerti / Learn2Clean
View on GitHub
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
☆54Jun 20, 2026Updated last month
umich-dbgroup / foofah
View on GitHub
Foofah: programming-by-example data transformation program synthesizer
☆28Apr 23, 2018Updated 8 years ago
dbunibas / BART
View on GitHub
The BART Project: Benchmarking Algorithms for (data) Repairing and Translation
☆43Nov 27, 2023Updated 2 years ago
j-r77 / cfddiscovery
View on GitHub
☆11Oct 31, 2019Updated 6 years ago
maropu / spark-data-repair-plugin
View on GitHub
Provide functionality to build statistical models to repair dirty tabular data in Spark
☆12Apr 21, 2023Updated 3 years ago
LiPengCS / Auto-Tables-Benchmark
View on GitHub
☆14Aug 31, 2023Updated 2 years ago
stefan-grafberger / mlinspect
View on GitHub
Inspect ML Pipelines in Python in the form of a DAG
☆70Feb 24, 2024Updated 2 years ago
data-centric-ai / dcbench
View on GitHub
A benchmark of data-centric tasks from across the machine learning lifecycle.
☆72Jun 8, 2022Updated 4 years ago
schelterlabs / jenga
View on GitHub
Jenga is an experimentation library that allows data science practititioners and researchers to study the effect of common data corruptio…
☆43Jun 21, 2023Updated 3 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
HazyResearch / fm_data_tasks
View on GitHub
Foundation Models for Data Tasks
☆112May 15, 2023Updated 3 years ago
codocedo / tane
View on GitHub
Implementation of TANE for experimental purposes
☆15Apr 29, 2022Updated 4 years ago
stefan-grafberger / mlwhatif
View on GitHub
Data-Centric What-If Analysis for Native Machine Learning Pipelines
☆16Jun 14, 2023Updated 3 years ago
SolidLao / SQLBarber
View on GitHub
SQLBarber is a system based on Large Language Models (LLMs) to generate customized and realistic SQL workloads.
☆15Apr 7, 2026Updated 3 months ago
BenchCouncil / BigVectorBench
View on GitHub
[VLDB 2025] BigVectorBench advances vector database benchmarking by defining and evaluating the embedding performance of heterogeneous da…
☆33Jan 17, 2025Updated last year
HPI-Information-Systems / snowman
View on GitHub
Welcome to Snowman App – a Data Matching Benchmark Platform.
☆38Feb 9, 2023Updated 3 years ago
taiduydinh / k-CMM
View on GitHub
This project proposes an algorithm named k-CMM for Clustering Mixed Numeric and Categorical Data with Missing Values
☆15Jul 9, 2021Updated 5 years ago
gdb / apex
View on GitHub
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
☆10Aug 13, 2024Updated last year
Jason-cs18 / Awesome-AI-Systems
View on GitHub
Resources for recent AI systems (deployment concerns, cost and accessibility). -- closed
☆12May 29, 2021Updated 5 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
DataResponsibly / FairPrep
View on GitHub
FairPrep is a design and evaluation framework for fairness-enhancing interventions that treats data as a first-class citizen.
☆11Mar 24, 2023Updated 3 years ago
duruiting / Active-Learning
View on GitHub
☆11Sep 23, 2020Updated 5 years ago
AnthonyHaozeZhu / Compiler
View on GitHub
南开大学编译原理课程所编写的简易编译器
☆11Dec 29, 2022Updated 3 years ago
Este1le / hpo_nmt
View on GitHub
Datasets for Hyperparameter Optimization of Neural Machine Translation
☆10Aug 19, 2024Updated last year
davidireland-iso / LeNSE
View on GitHub
☆14Nov 26, 2022Updated 3 years ago
edervishaj / gan-mf-thesis
View on GitHub
This is the repository for the Master of Science thesis titled "GAN-based Matrix Factorization for Recommender Systems".
☆10Aug 10, 2020Updated 5 years ago
MadryLab / datamodels-data
View on GitHub
Data for "Datamodels: Predicting Predictions with Training Data"
☆97May 25, 2023Updated 3 years ago
jeffhj / open-relation-modeling
View on GitHub
The implementation for "Open Relation Modeling: Learning to Define Relations between Entities" (Findings of ACL '22)
☆12Feb 28, 2022Updated 4 years ago
MadryLab / datamodels
View on GitHub
☆32May 24, 2023Updated 3 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
mmahdavian / semantic_visual_teach_repeat
View on GitHub
This is a robotic package for an algorithm for visual teach and repeat
☆15Jun 30, 2022Updated 4 years ago
VincentShenbw / similarityjoin
View on GitHub
Implementation of many similarity join algorithms.
☆15Mar 6, 2014Updated 12 years ago
Smrati8 / Database-System-Implementation
View on GitHub
Built a single-user database management system from scratch using C++ supporting some SQL & relational algebra operations
☆13Sep 24, 2020Updated 5 years ago
JoonyoungYi / LLORMA-tensorflow
View on GitHub
The tensorflow prototype of "Local Low-rank Matrix Approximation" (LLORMA)
☆10Jan 11, 2019Updated 7 years ago
hanmaxmax / Parallel-programming
View on GitHub
The course of Parallel programming in Nankai university（南开大学《并行程序设计》课程 by 王刚老师）
☆12Oct 5, 2022Updated 3 years ago
jp-sglab / Spherical_Hashing
View on GitHub
☆15Dec 28, 2023Updated 2 years ago
delftdata / valentine
View on GitHub
A tool facilitating matching columns across tabular datasets. It also serves as an experiment suite for state-of-the-art schema matching …
☆125May 15, 2026Updated 2 months ago