david-siqi-liu / sparklyclean
Optimal distributed data deduplication and supervised learning pipeline using Apache Spark
☆10Updated 4 years ago
Alternatives and similar repositories for sparklyclean:
Users that are interested in sparklyclean are comparing it to the libraries listed below
- ☆15Updated 2 years ago
- Explaining Inference Queries with Bayesian Optimization☆10Updated 4 years ago
- JedAI-WebApp is a GUI that facilitates the execution of JedAI. JedAI is an open source, high scalability toolkit that offers out-of-the-b…☆23Updated 2 years ago
- SparkER: an Entity Resolution framework for Apache Spark☆64Updated last year
- Condor allows for the specification of synopsis-based streaming jobs on top of general dataflow systems. Condor provides a collection of …☆13Updated 10 months ago
- An open source, high scalability toolkit in Java for Entity Resolution.☆218Updated last year
- Collection of some algorithms for entity resolution☆28Updated 9 years ago
- deep entity resolution lite version☆11Updated 5 years ago
- ☆16Updated 8 years ago
- Source code for several Metanome data profiling algorithms☆53Updated last year
- Record Linkage ToolKit (Find and link entities)☆110Updated last year
- This project focuses on DeepER, a deep learning framework for entity resolution (record deduplication). It examines how DeepER performs o…☆46Updated 6 years ago
- Distributed Bayesian Entity Resolution in Apache Spark☆57Updated 3 years ago
- UI for JedAI Toolkit☆17Updated 2 years ago
- T2K Match is a matching algorithm optimised to match millions of web tables to a central knowledge base.☆21Updated 6 years ago
- LSHDB is a parallel and distributed data engine, which relies on Locality-Sensitive Hashing and noSQL systems, for performing record link…☆31Updated 2 years ago
- Distributed similarity search☆9Updated 5 years ago
- A tool facilitating matching for any dataset discovery method. Also, an extensible experiment suite for state-of-the-art schema matching …☆88Updated 3 weeks ago
- Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum☆17Updated 2 years ago
- ☆77Updated 2 years ago
- ☆32Updated 3 years ago
- Rheem - a cross-platform data processing system☆5Updated 3 years ago
- WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing,…☆110Updated 2 years ago
- FlexMatcher is a schema matching package in Python which handles the problem of matching multiple schemas to a single mediated schema.☆29Updated 4 months ago
- A Machine Learning System for Data Enrichment.☆75Updated 6 years ago
- Stanford Entity-Resolution Framework☆23Updated 6 years ago
- Real-time query spark and visualise it as graph.☆24Updated 7 years ago
- A Generalized Data Cleaning System☆49Updated 8 years ago
- Code to extract functional dependencies (FDs) and conditional functional dependencies (CFDs) from data☆36Updated 4 years ago
- Sketch and LSH Index library for Java, including OPH methods as well as the Lazo method☆13Updated last year